Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
By: Youpeng Zhao , Jinpeng LV , Di Wu and more
Potential Business Impact:
Makes AI answer questions faster and cheaper.
Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.
Similar Papers
The Art of Scaling Test-Time Compute for Large Language Models
Computation and Language
Makes AI think better by changing how it works.
Revisiting Test-Time Scaling: A Survey and a Diversity-Aware Method for Efficient Reasoning
Computation and Language
Makes smart computer programs think better, faster.
Investigating Test-Time Scaling with Reranking for Machine Translation
Computation and Language
Makes computer translations better by trying many options.