Score: 0

SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

Published: August 18, 2025 | arXiv ID: 2508.12604v1

By: Yuyang Xu , Yi Cheng , Haochao Ying and more

Potential Business Impact:

Makes AI think smarter, faster, and fix its own mistakes.

Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Artificial Intelligence

Makes smart computers think faster, shorter answers.

13 Aug 2025 0

90%

Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

Computation and Language

Makes AI think smarter and finish faster.

7 Jan 2026 1

90%

Boosting LLM Reasoning via Spontaneous Self-Correction

Artificial Intelligence

Helps computers solve math problems better.

7 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

9 pages

SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

Makes AI think smarter, faster, and fix its own mistakes.

Technical Abstract

Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning

Boosting LLM Reasoning via Spontaneous Self-Correction