Score: 0

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs

Published: September 4, 2025 | arXiv ID: 2509.03846v1

By: Md Rownak Hossain Chowdhury, Mostafizur Rahman

Potential Business Impact:

Makes AI run faster using less power.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

We introduce a mapping framework for deep learning inference that takes advantage of predictable neural network behavior to plan both computation and communication ahead of time. The framework generates a unified stream of instructions and data, enabling the hardware to execute operations and route information on its own, without frequent involvement from the host and with minimal off-chip memory use. This naturally reduces reliance on I/O, off-chip memory, and host control. By leveraging fine-grained message passing on a programmable, message-based compute architecture, the framework keeps data movement local and coordinates computation across the array using techniques such as stationary-weight reuse, in-array multicasting, and staged reductions. Applied to VGG-19, the framework sustains high utilization (88 to 92 percent), with over 97 percent of messages generated internally and nearly 89 percent of time consumed on-chip transfers. Computation throughput scales beyond 1 TFLOP/s on larger arrays, while traffic reductions from reuse and local aggregation reach up to 100 MB per layer. Overall, the results highlight the effectiveness of streaming-based computation and show how our mapper enables this execution style by tightly coordinating data and instruction flow across the hardware.

Instruction-Based Coordination of Heterogeneous Processing Units for Acceleration of DNN Inference

Hardware Architecture

Speeds up AI by making computer chips work together.

19 Nov 2025 0

88%

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations

Hardware Architecture

Makes AI learn faster and use less power.

9 Oct 2025 0

88%

Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64

Computation and Language

Makes AI run faster on your phone.

6 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

11 pages

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs

Makes AI run faster using less power.

Technical Abstract

Instruction-Based Coordination of Heterogeneous Processing Units for Acceleration of DNN Inference

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations

Bare-Metal Tensor Virtualization: Overcoming the Memory Wall in Edge-AI Inference on ARM64