Learning to Reason in LLMs by Expectation Maximization
By: Junghyun Lee , Branislav Kveton , Sunav Choudhary and more
Potential Business Impact:
Helps computers think step-by-step to solve problems.
Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.
Similar Papers
Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Artificial Intelligence
Teaches computers to solve math problems correctly.
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Machine Learning (CS)
Makes AI smarter without extra training.
Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs
Computation and Language
Helps computers understand and use probability better.