Score: 1

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

Published: November 20, 2025 | arXiv ID: 2511.15950v1

By: Michael V. DeBole , Rathinakumar Appuswamy , Neil McGlohon and more

BigTech Affiliations: IBM

Potential Business Impact:

Makes AI smarter and faster for businesses.

Business Areas:

Intelligent Systems Artificial Intelligence, Data and Analytics, Science and Engineering

A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model.

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

Distributed, Parallel, and Cluster Computing

Makes AI answer questions much faster and cheaper.

16 Jul 2025 1

88%

OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

Distributed, Parallel, and Cluster Computing

Makes AI answer questions much faster.

27 Nov 2025 1

88%

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Distributed, Parallel, and Cluster Computing

Makes supercomputers run AI faster for many people.

26 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

9 pages

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

Makes AI smarter and faster for businesses.

Technical Abstract

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM