Score: 0

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Published: November 17, 2025 | arXiv ID: 2511.13676v1

By: Hyunwoo Oh , KyungIn Nam , Rajat Bhattacharjya and more

Potential Business Impact:

Makes smart AI run faster on small devices.

Business Areas:

RISC Hardware

Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.

TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge

Hardware Architecture

Makes smart computer programs run much faster.

17 Sep 2025 1

87%

Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization

Machine Learning (CS)

Makes computers think faster using clever tricks.

17 Oct 2025 0

87%

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV

Hardware Architecture

Makes AI smarter on regular computers.

30 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

8 pages

T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization

Makes smart AI run faster on small devices.

Technical Abstract

TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge

Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization

SAIL: SRAM-Accelerated LLM Inference System with Lookup-Table-based GEMV