Reasoning Pattern Matters: Learning to Reason without Human Rationales
By: Chaoxu Pang, Yixuan Cao, Ping Luo
Potential Business Impact:
Lets computers learn reasoning without human examples.
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.
Similar Papers
On the Mechanism of Reasoning Pattern Selection in Reinforcement Learning for Language Models
Machine Learning (CS)
Teaches computers to pick the best thinking steps.
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Computation and Language
Teaches computers to think better, not just copy.
RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
CV and Pattern Recognition
Teaches computers to understand videos better.