Score: 0

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Published: March 19, 2025 | arXiv ID: 2503.14882v1

By: Kai Zhang , Hengtao He , Shenghui Song and more

Potential Business Impact:

Lets phones run smart AI without the internet.

Business Areas:

Wireless Hardware, Mobile

Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To address this issue, we propose a novel distributed on-device LLM inference framework that leverages tensor parallelism to partition the neural network tensors (e.g., weight matrices) of one LLM across multiple edge devices for collaborative inference. A key challenge in tensor parallelism is the frequent all-reduce operations for aggregating intermediate layer outputs across participating devices, which incurs significant communication overhead. To alleviate this bottleneck, we propose an over-the-air computation (AirComp) approach that harnesses the analog superposition property of wireless multiple-access channels to perform fast all-reduce steps. To utilize the heterogeneous computational capabilities of edge devices and mitigate communication distortions, we investigate a joint model assignment and transceiver optimization problem to minimize the average transmission error. The resulting mixed-timescale stochastic non-convex optimization problem is intractable, and we propose an efficient two-stage algorithm to solve it. Moreover, we prove that the proposed algorithm converges almost surely to a stationary point of the original problem. Comprehensive simulation results will show that the proposed framework outperforms existing benchmark schemes, achieving up to 5x inference speed acceleration and improving inference accuracy.

Unfolded Deep Graph Learning for Networked Over-the-Air Computation

Signal Processing

Lets many devices share computing power wirelessly.

16 May 2025 0

89%

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Systems and Control

Makes AI answer questions much faster.

13 Oct 2025 1

89%

The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks

Machine Learning (CS)

Makes smart computer programs run faster on phones.

14 May 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

16 pages

Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks

Lets phones run smart AI without the internet.

Technical Abstract

Unfolded Deep Graph Learning for Networked Over-the-Air Computation

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks