Score: 3

The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

Published: July 14, 2025 | arXiv ID: 2507.09850v3

By: Wei Du , Branislav Kisacanin , George Armstrong and more

BigTech Affiliations: NVIDIA

Potential Business Impact:

Teaches computers to think step-by-step to solve problems.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model \texttt{QwQ-32B-Preview}, we lightly fine-tune the base model \texttt{Qwen2.5-32B}. The resulting model outperforms the much larger \texttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.

Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation

Artificial Intelligence

Teaches computers to think better, step-by-step.

20 Mar 2025 2

92%

Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

Computation and Language

Makes AI think better even with short questions.

13 Apr 2025 1

92%

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Computation and Language

Makes small computers think like big ones.

30 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com huggingface.co

Page Count

16 pages

The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

Teaches computers to think step-by-step to solve problems.

Technical Abstract

Deconstructing Long Chain-of-Thought: A Structured Reasoning Optimization Framework for Long CoT Distillation

Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math