Score: 0

Improving compiler support for SIMD offload using Arm Streaming SVE

Published: June 2, 2025 | arXiv ID: 2506.02233v1

By: Mohamed Husain Noor Mohamed , Adarsh Patil , Latchesar Ionkov and more

Potential Business Impact:

Helps computers use special chips for faster math.

Business Areas:

Quantum Computing Science and Engineering

The wider adoption of tightly coupled core-adjacent accelerators, such as Arm Scalable Matrix Extension (SME), hinges on lowering software programming complexity. In this paper, we focus on enabling the use of SME architecture in Streaming Scalable Vector Extension (SSVE) mode for workloads written in C/C++. While current compilers optimize loops for all types of SIMD instructions, these techniques primarily target vector units within the core and falter when applied to disaggregated, core-adjacent SIMD accelerators. Our goal is to enable the compiler to automatically generate code for such accelerators only when profitable. To this end, we investigate a path towards performant, precise, and repeatable computation offloading through two compiler ecosystems. We revisit LLVM compiler passes, MLIR transforms and their associated cost models, and heuristics. We hope that these insights can provide directions for evolving compiler capabilities towards automatic code generation for this next-generation vector processing paradigm.

Performance Optimization of 3D Stencil Computation on ARM Scalable Vector Extension

Performance

Speeds up computer weather forecasts and saves energy.

3 Mar 2025 0

87%

Retrofitting Control Flow Graphs in LLVM IR for Auto Vectorization

Programming Languages

Makes computer programs run much faster.

6 Oct 2025 0

87%

ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace

Distributed, Parallel, and Cluster Computing

Makes computers run much faster using special instructions.

14 May 2025 0

View PDF Login to Bookmark

Page Count

9 pages

Improving compiler support for SIMD offload using Arm Streaming SVE

Helps computers use special chips for faster math.

Technical Abstract

Performance Optimization of 3D Stencil Computation on ARM Scalable Vector Extension

Retrofitting Control Flow Graphs in LLVM IR for Auto Vectorization

ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace