Score: 0

Beyond Single-GPU: Scaling PDLP to Distributed Multi-GPU Systems

Published: January 12, 2026 | arXiv ID: 2601.07628v1

By: Hongpei Li , Yicheng Huang , Huikang Liu and more

Potential Business Impact:

Solves huge math problems faster on many computers.

Business Areas:

GPU Hardware

In this work, we present a distributed implementation of the Primal-Dual Hybrid Gradient (PDHG) algorithm for solving massive-scale linear programming (LP) problems. Although PDHG-based solvers have shown strong performance on single-node GPU architectures, their applicability to industrial-scale instances is often limited by GPU memory capacity and computational throughput. To overcome these challenges, we extend the PDHG framework to a distributed-memory setting via a practical two-dimensional grid partitioning of the constraint matrix, enabling scalable execution across multiple GPUs. Our implementation leverages the NCCL communication backend to efficiently synchronize primal-dual updates across devices. To improve load balance and computational efficiency, we introduce a block-wise random shuffling strategy combined with nonzero-aware data distribution, and further accelerate computation through fused CUDA kernels. By distributing both memory and computation, the proposed framework not only overcomes the single-GPU memory bottleneck but also achieves substantial speedups by exploiting multi-GPU parallelism with relatively low communication overhead. Extensive experiments on standard LP benchmarks, including MIPLIB and Hans' instances, as well as large-scale real-world datasets, show that our distributed implementation, built upon cuPDLPx, achieves strong scalability and high performance while preserving full FP64 numerical accuracy.