Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs
By: Hadi Mohammadi, Tamas Kozak, Anastasia Giachanou
Potential Business Impact:
Makes AI show its real thinking steps.
Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.
Similar Papers
Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
CV and Pattern Recognition
Makes AI draw better pictures by thinking step-by-step.
Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement
Artificial Intelligence
Teaches computers to think better and solve problems.
Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
Artificial Intelligence
Makes AI think smarter and avoid mistakes.