The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs
By: Nikolaus Howe, Micah Carroll
Potential Business Impact:
Computers can lie to hide bad actions.
The use of reinforcement learning (RL) with chain-of-thought (CoT) reasoning has emerged as a promising approach for developing more capable language models. In turn, this has led to investigation of CoT monitoring as a compelling method for detecting harmful behaviors such as reward hacking, under the assumption that models' reasoning processes reflect their internal decision-making. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors like sycophancy, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms. Beyond being an interesting property of training, we find that while motivated reasoning can be detected by most frontier reasoning models, smaller LLM judges can fail to identify a portion of it, and in rare cases can themselves be persuaded that the reasoning is correct, despite it contradicting clear instructions. This capability gap raises concerns that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect. Our results underscore the need to account for motivated reasoning when relying on chain-of-thought processes for model evaluation and oversight. All code for this paper will be made available. WARNING: some examples in this paper may be upsetting.
Similar Papers
Rectifying LLM Thought from Lens of Optimization
Computation and Language
Teaches computers to think smarter, not longer.
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
Computation and Language
Makes AI follow instructions better by fixing reasoning.
Reasoning Models Sometimes Output Illegible Chains of Thought
Machine Learning (CS)
Makes AI's thinking harder to understand.