Score: 1

Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

Published: November 10, 2025 | arXiv ID: 2511.06907v1

By: Ilias Papalamprou , Dimosthenis Masouros , Ioannis Loudaros and more

Potential Business Impact:

Makes computer math faster and use less power.

Business Areas:

GPU Hardware

General Matrix Multiplication (GEMM) is a fundamental operation in many scientific workloads, signal processing, and particularly deep learning. It is often a bottleneck for performance and energy efficiency, especially in edge environments with tight resource and power constraints. AMD's Versal ACAP offers heterogeneous components (AIEs, PL, PS) that can address these challenges, but mapping GEMM across them is complex, with prior works largely overlooking energy-performance trade-offs. In this paper, we propose an automated framework for Versal ACAP that generates GEMM mappings optimized for either performance or energy efficiency. Unlike prior analytical approaches, our method leverages a Machine Learning (ML) model, trained on approximately 6000 on-board experiments of different GEMM mappings, to guide Design Space Exploration, yielding more efficient designs. Evaluation on the Versal VCK190 shows geomean improvements of 1.23x (up to 2.5x) in throughput and 1.25x (up to 2.7x) in energy efficiency over state-of-the-art frameworks.

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Distributed, Parallel, and Cluster Computing

Makes computers solve problems faster and use less power.

20 Aug 2025 0

90%

GAMA: High-Performance GEMM Acceleration on AMD Versal ML-Optimized AI Engines

Hardware Architecture

Makes AI learn much faster on special chips.

13 Apr 2025 2

89%

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

Hardware Architecture

Makes AI run much faster on new computer chips.

15 Dec 2025 2

View PDF Login to Bookmark

Page Count

8 pages

Optimizing GEMM for Energy and Performance on Versal ACAP Architectures

Makes computer math faster and use less power.

Technical Abstract

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

GAMA: High-Performance GEMM Acceleration on AMD Versal ML-Optimized AI Engines

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs