The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training
By: Subramanyam Sahoo
Potential Business Impact:
Improves math AI by rewarding good reasoning.
Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.
Similar Papers
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
Computation and Language
Teaches computers to solve harder math problems.
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Artificial Intelligence
Teaches AI to follow instructions better.
Uncertainty Quantification for Large Language Model Reward Learning under Heterogeneous Human Feedback
Machine Learning (Stat)
Makes AI understand what people like better.