Score: 2

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Published: September 5, 2025 | arXiv ID: 2509.05440v1

By: Logan Lawrence, Ashton Williamson, Alexander Shelton

Potential Business Impact:

Lets computers give a grade to writing.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large-language models have been increasingly used as automatic raters for evaluating free-form content, including document summarization, dialog, and story generation, work has been dedicated to evaluating such models by measuring their correlations with human judgment. For \textit{sample-level} performance, methods which operate by using pairwise comparisons between machine-generated text perform well but often lack the ability to assign absolute scores to individual summaries, an ability crucial for use cases that require thresholding. In this work, we propose a direct-scoring method which uses synthetic summaries to act as pairwise machine rankings at test time. We show that our method performs comparably to state-of-the-art pairwise evaluators in terms of axis-averaged sample-level correlations on the SummEval (\textbf{+0.03}), TopicalChat (\textbf{-0.03}), and HANNA (\textbf{+0.05}) meta-evaluation benchmarks, and release the synthetic in-context summaries as data to facilitate future work.

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Computation and Language

Checks writing quality better than other tools.

2 Apr 2025 1

88%

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

Computation and Language

Helps computers grade essays more like humans.

13 May 2025 0

88%

An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

Computation and Language

Finds best AI for summarizing text.

6 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

12 pages

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Lets computers give a grade to writing.

Technical Abstract

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models