Score: 1

SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding

Published: January 16, 2026 | arXiv ID: 2601.10953v1

By: Junming Zhang , Qinyan Zhang , Huajun Sun and more

Potential Business Impact:

Makes smart computer programs run much faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Edge acceleration for large language models is crucial for their widespread application; however, achieving fast attention inference and efficient decoding on resource-constrained edge accelerators remains challenging. This paper presents SwiftKV Attention, a per-token pipelined, low-latency single-pass attention inference algorithm, where every (kt, vt) in the KV cache is processed exactly once in a uniform per-token pipeline without score materialization, blockwise softmax, or a second pass, thereby enabling fast execution on edge accelerators with a single hardware set and no resource-intensive parallelism. Furthermore, to address the limited support for multi-head LLM decoding in existing accelerators, we design the SwiftKV-MHA accelerator, which enables high precision attention and low precision GEMV on the same processor array, achieving fast and efficient multi-head parallel decoding. Experimental results show that, on the edge accelerator, the SwiftKV Attention algorithm achieves a 7.16* speedup over native attention and significantly outperforms other attention algorithms. SwiftKV-MHA further reduces attention latency by 13.48*; under the same settings, it improves generation speed by 17.4% and increases token efficiency by 1.98* compared with state-of-the-art works.

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

Computation and Language

Lets computers remember much longer stories.

26 Jul 2025 1

88%

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Artificial Intelligence

Makes AI remember more without slowing down.

5 Nov 2025 0

88%

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Multimedia

Makes AI understand videos much faster.

29 Oct 2025 1

View PDF Login to Bookmark

Page Count

5 pages

SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding

Makes smart computer programs run much faster.

Technical Abstract

HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models