Score: 0

Token Level Routing Inference System for Edge Devices

Published: April 10, 2025 | arXiv ID: 2504.07878v1

By: Jianshu She , Wenhao Zheng , Zhengzhong Liu and more

Potential Business Impact:

Makes small AI smart enough for big jobs.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.

Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Machine Learning (CS)

Lets smart computers use less power.

6 Jun 2025 1

90%

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

Distributed, Parallel, and Cluster Computing

Smart computers work together for faster, private AI.

22 Jul 2025 0

90%

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Systems and Control

Makes AI answer questions much faster.

13 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇦🇪 United Arab Emirates

Page Count

8 pages

Token Level Routing Inference System for Edge Devices

Makes small AI smart enough for big jobs.

Technical Abstract

Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding