Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
By: Felipe Maia Polo , Xinhe Wang , Mikhail Yurochkin and more
Potential Business Impact:
Makes AI judges agree better with people.
Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.
Similar Papers
Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
Artificial Intelligence
Makes AI judges fairer and more trustworthy.
How to Correctly Report LLM-as-a-Judge Evaluations
Machine Learning (CS)
Fixes computer judge mistakes for fairer tests.
From Code to Courtroom: LLMs as the New Software Judges
Software Engineering
Lets computers check other computer code quality.