PiCSAR: Probabilistic Confidence Selection And Ranking
By: Joshua Ong Jun Leang , Zheng Zhao , Aryo Pradipta Gema and more
Potential Business Impact:
Helps smart computers solve hard problems better.
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
Similar Papers
PACR: Progressively Ascending Confidence Reward for LLM Reasoning
Artificial Intelligence
Helps AI learn faster by rewarding good thinking steps.
ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges
Artificial Intelligence
Checks if AI's thinking is trustworthy.
Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space
Computation and Language
Helps AI know how sure it is about answers.