DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
By: Daniil Larionov , Sotaro Takeshita , Ran Zhang and more
Potential Business Impact:
Helps computers judge writing quality better.
Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks. We evaluate eight models spanning state-of-the-art reasoning models (DeepSeek-R1, OpenAI o3), their distilled variants (8B-70B parameters), and equivalent non-reasoning LLMs. Experiments on WMT23 and SummEval benchmarks reveal architecture and task-dependent benefits: OpenAI o3-mini models show improved performance with increased reasoning on MT, while DeepSeek-R1 and generally underperforms compared to its non-reasoning variant except in summarization consistency evaluation. Correlation analysis demonstrates that reasoning token usage correlates with evaluation quality only in specific models, while almost all models generally allocate more reasoning tokens when identifying more quality issues. Distillation maintains reasonable performance up to 32B parameter models but degrades substantially at 8B scale. This work provides the first assessment of reasoning LLMs for NLG evaluation and comparison to non-reasoning models. We share our code to facilitate further research: https://github.com/NL2G/reasoning-eval.
Similar Papers
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases
Computation and Language
Tests AI doctors' thinking for better patient care.
AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Computation and Language
Helps computers understand Arabic text better.
Medical Reasoning in LLMs: An In-Depth Analysis of DeepSeek R1
Computation and Language
Helps doctors diagnose sickness with smart computer.