Score: 0

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Published: January 12, 2026 | arXiv ID: 2601.07353v1

By: Tianyu Liu , Qitan Lv , Yuhao Shen and more

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a "deep-and-narrow" form for deterministic contexts and a "shallow-and-wide" form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Computation and Language

Makes AI talk and write much faster.

30 Oct 2025 1

88%

LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models

CV and Pattern Recognition

Makes AI draw pictures much faster.

10 Feb 2025 4

88%

RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees

Artificial Intelligence

Makes AI write faster by guessing better.

16 Dec 2025 1

View PDF Login to Bookmark

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Technical Abstract

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models

RADAR: Accelerating Large Language Model Inference With RL-Based Dynamic Draft Trees