When to Reason: Semantic Router for vLLM
By: Chen Wang , Xunzhuo Liu , Yuhan Liu and more
Potential Business Impact:
Smartly uses AI power, saving time and money.
Large Language Models (LLMs) demonstrate substantial accuracy gains when augmented with reasoning modes such as chain-of-thought and inference-time scaling. However, reasoning also incurs significant costs in inference latency and token usage, with environmental and financial impacts, which are unnecessary for many simple prompts. We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial. Our approach achieves a 10.2 percentage point improvement in accuracy on the MMLU-Pro benchmark while reducing response latency by 47.1% and token consumption by 48.5% compared to direct inference with vLLM. These results demonstrate that semantic routing offers an effective mechanism for striking a balance between accuracy and efficiency in open-source LLM serving systems
Similar Papers
From Query to Logic: Ontology-Driven Multi-Hop Reasoning in LLMs
Computation and Language
Helps computers answer tricky questions by thinking step-by-step.
Reasoning Models Reason Well, Until They Don't
Artificial Intelligence
Makes smart computers better at solving hard problems.
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
Machine Learning (CS)
Lets smart computers use less power.