Communication-Efficient Distributed On-Device LLM Inference Over Wireless Networks
By: Kai Zhang , Hengtao He , Shenghui Song and more
Potential Business Impact:
Lets phones run smart AI without the internet.
Large language models (LLMs) have demonstrated remarkable success across various application domains, but their enormous sizes and computational demands pose significant challenges for deployment on resource-constrained edge devices. To address this issue, we propose a novel distributed on-device LLM inference framework that leverages tensor parallelism to partition the neural network tensors (e.g., weight matrices) of one LLM across multiple edge devices for collaborative inference. A key challenge in tensor parallelism is the frequent all-reduce operations for aggregating intermediate layer outputs across participating devices, which incurs significant communication overhead. To alleviate this bottleneck, we propose an over-the-air computation (AirComp) approach that harnesses the analog superposition property of wireless multiple-access channels to perform fast all-reduce steps. To utilize the heterogeneous computational capabilities of edge devices and mitigate communication distortions, we investigate a joint model assignment and transceiver optimization problem to minimize the average transmission error. The resulting mixed-timescale stochastic non-convex optimization problem is intractable, and we propose an efficient two-stage algorithm to solve it. Moreover, we prove that the proposed algorithm converges almost surely to a stationary point of the original problem. Comprehensive simulation results will show that the proposed framework outperforms existing benchmark schemes, achieving up to 5x inference speed acceleration and improving inference accuracy.
Similar Papers
Unfolded Deep Graph Learning for Networked Over-the-Air Computation
Signal Processing
Lets many devices share computing power wirelessly.
Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding
Systems and Control
Makes AI answer questions much faster.
The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks
Machine Learning (CS)
Makes smart computer programs run faster on phones.