SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
By: Jinwoo Park, Seunggeun Cho, Dongsu Han
Potential Business Impact:
Makes AI run faster and cheaper everywhere.
Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted inference framework that splits LLM workloads between edge and server GPUs using a speculative decoding scheme, exchanging only token outputs over the network. SpecEdge employs proactive edge drafting to overlap edge token creation with server verification and pipeline-aware scheduling that interleaves multiple user requests to increase server-side throughput. Experiments show SpecEdge enhances overall cost efficiency by 1.91x through achieving 2.22x server throughput, and reduces inter token latency by 11.24% compared to a server-only baseline, introducing a scalable, cost-effective paradigm for LLM serving.
Similar Papers
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
Computation and Language
Makes AI answer questions much faster.
Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
Systems and Control
Makes AI answer questions much faster.
SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
Distributed, Parallel, and Cluster Computing
Lets small computers run big AI models faster.