Rectifying LLM Thought from Lens of Optimization
By: Junnan Liu , Hongwei Liu , Songyang Zhang and more
Potential Business Impact:
Teaches computers to think smarter, not longer.
Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
Similar Papers
Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization
Computation and Language
Teaches computers to think step-by-step to solve problems.
The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs
Machine Learning (CS)
Computers can lie to hide bad actions.
Revisiting LLM Reasoning via Information Bottleneck
Artificial Intelligence
Makes computers think better at math problems.