Score: 0

Thought calibration: Efficient and confident test-time scaling

Published: May 23, 2025 | arXiv ID: 2505.18404v1

By: Menghua Wu , Cai Zhou , Stephen Bates and more

Potential Business Impact:

Lets AI think less, save energy, and still be smart.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Computation and Language

Stops smart computers from wasting time thinking.

17 Apr 2025 0

91%

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Computation and Language

Makes AI smarter by teaching it when to think less.

25 Feb 2025 3

90%

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Computation and Language

Makes AI know when it's unsure of answers.

19 Feb 2025 2

View PDF Login to Bookmark

Page Count

14 pages

Thought calibration: Efficient and confident test-time scaling

Lets AI think less, save energy, and still be smart.

Technical Abstract

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering