Score: 0

TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale

Published: December 20, 2025 | arXiv ID: 2512.18194v1

By: Dongha Yoon , Younghoon Min , Hoshik Kim and more

Disaggregated LLM serving improves resource efficiency by separating the compute-intensive prefill phase from the latency-critical decode phase. However, this architecture introduces a fundamental bottleneck: key/value (KV) tensors generated during prefill must be transferred to decode workers, and existing systems rely on RDMA-based network paths for this exchange. As model sizes and context lengths increase, KV transfer dominates both time-to-first-token (TTFT) and peak throughput, and remains highly sensitive to network contention even when prefix reuse is high. This paper presents TraCT, a rack-scale LLM serving system that uses CXL shared memory as both a KV-transfer substrate and a rack-wide prefix-aware KV cache. TraCT enables GPUs to write and read KV blocks directly through CXL load/store and DMA operations, eliminating the NIC hop that constrains existing disaggregated pipelines. However, to realize this design, multiple new challenges such as synchronization, consistency, and data management on non-coherent CXL memory need to be addressed. TraCT proposes various software solutions such as the two-tier inter-node synchronization mechanism to address these challenges. We implement TraCT on the Dynamo LLM inference framework and show that, across static and synthetic workloads, TraCT reduces average TTFT by up to 9.8x, lowers P99 latency by up to 6.2x, and improves peak throughput by up to 1.6x compared to RDMA and DRAM-based caching baselines.

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Distributed, Parallel, and Cluster Computing

Makes AI image creation faster and cheaper.

18 Dec 2025 1

87%

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

Distributed, Parallel, and Cluster Computing

Makes AI models load much faster for users.

1 Dec 2025 2

87%

FlexKV: Flexible Index Offloading for Memory-Disaggregated Key-Value Store

Distributed, Parallel, and Cluster Computing

Makes computer data storage much faster.

18 Dec 2025 1

View PDF Login to Bookmark

TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale

Technical Abstract

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity

FlexKV: Flexible Index Offloading for Memory-Disaggregated Key-Value Store