Score: 0

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs

Published: September 4, 2025 | arXiv ID: 2509.03846v1

By: Md Rownak Hossain Chowdhury, Mostafizur Rahman

Potential Business Impact:

Makes AI run faster using less power.

Business Areas:
Machine Learning Artificial Intelligence, Data and Analytics, Software

We introduce a mapping framework for deep learning inference that takes advantage of predictable neural network behavior to plan both computation and communication ahead of time. The framework generates a unified stream of instructions and data, enabling the hardware to execute operations and route information on its own, without frequent involvement from the host and with minimal off-chip memory use. This naturally reduces reliance on I/O, off-chip memory, and host control. By leveraging fine-grained message passing on a programmable, message-based compute architecture, the framework keeps data movement local and coordinates computation across the array using techniques such as stationary-weight reuse, in-array multicasting, and staged reductions. Applied to VGG-19, the framework sustains high utilization (88 to 92 percent), with over 97 percent of messages generated internally and nearly 89 percent of time consumed on-chip transfers. Computation throughput scales beyond 1 TFLOP/s on larger arrays, while traffic reductions from reuse and local aggregation reach up to 100 MB per layer. Overall, the results highlight the effectiveness of streaming-based computation and show how our mapper enables this execution style by tightly coordinating data and instruction flow across the hardware.

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Page Count
11 pages

Category
Computer Science:
Hardware Architecture