Score: 1

Making Strong Error-Correcting Codes Work Effectively for HBM in AI Inference

Published: December 20, 2025 | arXiv ID: 2512.18152v1

By: Rui Xie , Yunhua Fang , Asad Ul Haq and more

BigTech Affiliations: IBM

Potential Business Impact:

Makes computer memory cheaper and more reliable.

Business Areas:
Hardware Hardware

LLM inference is increasingly memory bound, and HBM cost per GB dominates system cost. Current HBM stacks include short on-die ECC that tightens binning, raises price, and fixes reliability policy inside the device. This paper asks whether a system can tolerate a much higher raw HBM bit error rate and still keep end-to-end correctness and throughput, without changing the HBM PHY or the fixed 32 B transaction size. We propose REACH, a controller managed ECC design that keeps the HBM link and 32 B transfers unchanged. REACH uses a two level Reed-Solomon scheme: each 32 B chunk uses an inner code to check and correct most faults locally, while chunks that cannot be fixed are marked as erasures. An outer code spans kilobytes and runs in erasure only mode, repairing only flagged chunks and avoiding the expensive locator step. For small random writes, REACH updates outer parity with differential parity to avoid recomputing parity over the whole span, and an optional importance adaptive bit plane policy can protect only critical fields such as BF16 exponents to reduce ECC work and traffic. On three LLMs at 8K context, REACH keeps about 79 percent of on-die ECC throughput at zero BER and remains qualified up to a raw BER of 1e-3, extending tolerable device error rates by about three orders of magnitude while keeping tokens per second nearly flat. In ASAP7, a full REACH controller occupies 15.2 mm2 and consumes 17.5 W at 3.56 TB/s, and it reduces ECC area by 11.6x and power by about 60 percent compared to a naive long Reed-Solomon baseline. By moving strong ECC into the controller, REACH turns long code reliability into a system choice that can enable lower cost HBM under the same standard interface.

Country of Origin
🇺🇸 United States

Page Count
13 pages

Category
Computer Science:
Hardware Architecture