Score: 2

xLLM Technical Report

Published: October 16, 2025 | arXiv ID: 2510.14686v1

By: Tongxuan Liu , Tao Peng , Peijun Yang and more

BigTech Affiliations: JD.com

Potential Business Impact:

Makes smart computer programs run much faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework designed for high-performance, large-scale enterprise-grade serving, with deep optimizations for diverse AI accelerators. To address these challenges, xLLM builds a novel decoupled service-engine architecture. At the service layer, xLLM-Service features an intelligent scheduling module that efficiently processes multimodal requests and co-locates online and offline tasks through unified elastic scheduling to maximize cluster utilization. This module also relies on a workload-adaptive dynamic Prefill-Decode (PD) disaggregation policy and a novel Encode-Prefill-Decode (EPD) disaggregation policy designed for multimodal inputs. Furthermore, it incorporates a distributed architecture to provide global KV Cache management and robust fault-tolerant capabilities for high availability. At the engine layer, xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources. This is achieved through comprehensive multi-layer execution pipeline optimizations, an adaptive graph mode and an xTensor memory management. xLLM-Engine also further integrates algorithmic enhancements such as optimized speculative decoding and dynamic EPLB, collectively serving to substantially boost throughput and inference efficiency. Extensive evaluations demonstrate that xLLM delivers significantly superior performance and resource efficiency. Under identical TPOT constraints, xLLM achieves throughput up to 1.7x that of MindIE and 2.2x that of vLLM-Ascend with Qwen-series models, while maintaining an average throughput of 1.7x that of MindIE with Deepseek-series models. xLLM framework is publicly available at https://github.com/jd-opensource/xllm and https://github.com/jd-opensource/xllm-service.

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

Distributed, Parallel, and Cluster Computing

Makes smart computer programs run much faster.

25 Aug 2025 0

89%

An Explorative Study on Distributed Computing Techniques in Training and Inference of Large Language Models

Distributed, Parallel, and Cluster Computing

Lets big AI run on normal computers.

13 Oct 2025 1

89%

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI image creation faster and cheaper.

18 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

39 pages

xLLM Technical Report

Makes smart computer programs run much faster.

Technical Abstract

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

An Explorative Study on Distributed Computing Techniques in Training and Inference of Large Language Models

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving