Score: 1

When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator

Published: April 30, 2025 | arXiv ID: 2505.03786v1

By: Md Fahim Anjum

Potential Business Impact:

Helps computers understand questions better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.

DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Computation and Language

Helps computers judge writing quality better.

10 Apr 2025 2

91%

Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

Computation and Language

Tests AI doctors' thinking for better patient care.

6 Mar 2025 1

91%

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Computation and Language

Makes small computers think like big ones.

30 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

20 pages

When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator

Helps computers understand questions better.

Technical Abstract

DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math