Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation
By: Wei-Hsiang Lin , Sheng-Lun Wei , Hen-Hsen Huang and more
Potential Business Impact:
AI judges itself better using its own answers.
LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models' generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model's own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.
Similar Papers
Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation
Computation and Language
AI judges ignore correct answers when they disagree.
Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting
Computation and Language
AI grades homework like a teacher.
From Code to Courtroom: LLMs as the New Software Judges
Software Engineering
Lets computers check other computer code quality.