Score: 0

Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Published: December 29, 2025 | arXiv ID: 2512.23858v1

By: Yue Guan , Changming Yu , Shihan Fang and more

Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Distributed, Parallel, and Cluster Computing

Makes AI talk faster when shared.

13 Nov 2025 0

88%

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Computation and Language

Makes AI talk and write much faster.

30 Oct 2025 1

87%

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Distributed, Parallel, and Cluster Computing

Makes AI answer faster by guessing better.

27 Dec 2025 0

View PDF Login to Bookmark

Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Technical Abstract

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving