Score: 1

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Published: April 2, 2025 | arXiv ID: 2504.02106v1

By: Xiao Wang , Daniil Larionov , Siwei Wu and more

Potential Business Impact:

Checks writing quality better than other tools.

Business Areas:

Text Analytics Data and Analytics, Software

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

Computation and Language

Lets computers give a grade to writing.

5 Sep 2025 2

88%

SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models

Computation and Language

Tests AI to see if it's reliable.

28 Feb 2025 2

88%

Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?

Computation and Language

Tests how well computers summarize Spanish and Basque text.

21 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇩🇪 🇬🇧 United Kingdom, Germany

Page Count

14 pages

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Checks writing quality better than other tools.

Technical Abstract

Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too

SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models

Summarization Metrics for Spanish and Basque: Do Automatic Scores and LLM-Judges Correlate with Humans?