Score: 1

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Published: October 29, 2025 | arXiv ID: 2510.25860v1

By: Xingjian Zhang , Tianhong Gao , Suliang Jin and more

Potential Business Impact:

Helps computers explain their answers better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Artificial Intelligence

Teaches computers to think more like people.

20 Nov 2025 2

90%

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Artificial Intelligence

Computers that "think" judge better than those that don't.

9 Sep 2025 0

90%

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost

Computation and Language

Makes computers better at judging translated words.

23 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

12 pages

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Helps computers explain their answers better.

Technical Abstract

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness

Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost