Score: 2

HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network

Published: January 16, 2026 | arXiv ID: 2601.11676v1

By: Peirong Zheng , Wenchao Xu , Haozhao Wang and more

Potential Business Impact:

Makes AI work faster on phones, even with bad internet.

Business Areas:

Semantic Web Internet Services

The deployment of large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy. However, it is critically challenged by the resource constraints of a single edge node. Distributed inference has emerged to aggregate and leverage computational resources across multiple devices. Yet, existing methods typically require strict synchronization, which is often infeasible due to the unreliable network conditions. In this paper, we propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network. The core idea is to enable a relaxed yet effective synchronization by strategically allocating less critical neuron groups to unstable devices, thus avoiding the excessive waiting time incurred by delayed packets. HALO introduces three key mechanisms: (1) a semantic-aware predictor to assess the significance of neuron groups prior to activation. (2) a parallel execution scheme of neuron group loading during the model inference. (3) a load-balancing scheduler that efficiently orchestrates multiple devices with heterogeneous resources. Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions. It maintains performance comparable to optimal conditions and significantly outperforms the state-of-the-art in various scenarios.

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

Distributed, Parallel, and Cluster Computing

Lets big computer brains work on small devices.

26 Dec 2025 1

90%

HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

Hardware Architecture

Makes AI chatbots answer questions much faster.

3 Oct 2025 0

89%

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Systems and Control

Makes AI answer questions much faster.

13 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇦 🇭🇰 🇨🇳 Canada, China, Hong Kong

Page Count

10 pages

HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network

Makes AI work faster on phones, even with bad internet.

Technical Abstract

LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices

HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding