Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks
By: Rui Bao , Nan Xue , Yaping Sun and more
Potential Business Impact:
Makes smart assistants answer faster and better.
The integration of wireless communications and Large Language Models (LLMs) is poised to unlock ubiquitous intelligent services, yet deploying them in wireless edge-device collaborative environments presents a critical trade-off between inference quality and end-to-end latency. A fundamental mismatch exists between task complexity and resource allocation: offloading simple queries invites prohibitive latency, while on-device models lack the capacity for demanding computations. To address this challenge, we propose a dynamic, quality-latency aware routing framework that orchestrates inference between a lightweight model on the mobile device and a powerful model on the edge server. Our framework employs two distinct cost models: for single-turn queries, it fuses a BERT-predicted semantic score with communication and computation overheads; for multi-turn dialogues, it further quantifies context-aware costs arising from model switching and KV-cache management. While maintaining full inference quality, extensive experiments demonstrate that our framework cuts average response latency by 5-15% and reduces large model invocations by 10-20% against competitive baselines on MMLU, GSM8K, and MT-Bench-101 benchmarks.
Similar Papers
Quality-of-Service Aware LLM Routing for Edge Computing with Multiple Experts
Networking and Internet Architecture
Speeds up AI responses while keeping data private.
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
Machine Learning (CS)
Lets smart computers use less power.
Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
Systems and Control
Makes AI answer questions much faster.