Score: 1

FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding

Published: January 2, 2026 | arXiv ID: 2601.00644v1

By: Yuchen Li , Rui Kong , Zhonghao Lyu and more

Potential Business Impact:

Lets phones run smart AI without slow internet.

Business Areas:

Cloud Computing Internet Services, Software

Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with speculative decoding (SD) can reduce end-to-end latency by executing a lightweight draft model at the edge and verifying it with a cloud-side target model, existing frameworks fundamentally rely on tight coupling between the two models. Consequently, repeated model synchronization introduces excessive communication overhead, increasing end-to-end latency, and ultimately limiting the scalability of SD in edge environments. To address these limitations, we propose FlexSpec, a communication-efficient collaborative inference framework tailored for evolving edge-cloud systems. The core design of FlexSpec is a shared-backbone architecture that allows a single and static edge-side draft model to remain compatible with a large family of evolving cloud-side target models. By decoupling edge deployment from cloud-side model updates, FlexSpec eliminates the need for edge-side retraining or repeated model downloads, substantially reducing communication and maintenance costs. Furthermore, to accommodate time-varying wireless conditions and heterogeneous device constraints, we develop a channel-aware adaptive speculation mechanism that dynamically adjusts the speculative draft length based on real-time channel state information and device energy budgets. Extensive experiments demonstrate that FlexSpec achieves superior performance compared to conventional SD approaches in terms of inference efficiency.

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

Machine Learning (CS)

Makes AI write faster without losing quality.

4 Feb 2025 1

90%

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Distributed, Parallel, and Cluster Computing

Makes AI talk faster when shared.

13 Nov 2025 0

89%

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Machine Learning (CS)

Makes AI talk faster on many devices.

26 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇦 🇨🇳 🇸🇪 China, Canada, Sweden

Page Count

12 pages

FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding

Lets phones run smart AI without slow internet.

Technical Abstract

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving