Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
By: Zailong Tian , Zhuoheng Han , Yanzhe Chen and more
Potential Business Impact:
Makes AI judges more honest about what they know.
Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of well-calibrated confidence, which is vital for adaptive and reliable evaluation pipelines. In this work, we advocate a shift from accuracy-centric evaluation to confidence-driven, risk-aware LLM-as-a-Judge systems, emphasizing the necessity of well-calibrated confidence for trustworthy and adaptive evaluation. We systematically identify the **Overconfidence Phenomenon** in current LLM-as-a-Judges, where predicted confidence significantly overstates actual correctness, undermining reliability in practical deployment. To quantify this phenomenon, we introduce **TH-Score**, a novel metric measuring confidence-accuracy alignment. Furthermore, we propose **LLM-as-a-Fuser**, an ensemble framework that transforms LLMs into reliable, risk-aware evaluators. Extensive experiments demonstrate that our approach substantially improves calibration and enables adaptive, confidence-driven evaluation pipelines, achieving superior reliability and accuracy compared to existing baselines.
Similar Papers
Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Artificial Intelligence
Makes AI judges more honest about what they know.
How to Correctly Report LLM-as-a-Judge Evaluations
Machine Learning (CS)
Fixes computer judge mistakes for fairer tests.
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Artificial Intelligence
Makes AI judges fairer and more trustworthy.