Score: 1

Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression

Published: May 3, 2025 | arXiv ID: 2505.01637v1

By: Samuel J. Kaufman, René Just, Rastislav Bodik

BigTech Affiliations: University of Washington

Potential Business Impact:

Makes computer programs run much faster.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

High-throughput neural network inference requires coordinating many optimization decisions, including parallel tiling, microkernel selection, and data layout. The product of these decisions forms a search space of programs which is typically intractably large. Existing approaches (e.g., auto-schedulers) often address this problem by sampling this space heuristically. In contrast, we introduce a dynamic-programming-based approach to explore more of the search space by iteratively decomposing large program specifications into smaller specifications reachable from a set of rewrites, then composing a final program from each rewrite that minimizes an affine cost model. To reduce memory requirements, we employ a novel memoization table representation, which indexes specifications by coordinates in $Z_{\geq 0}$ and compresses identical, adjacent solutions. This approach can visit a much larger set of programs than prior work. To evaluate the approach, we developed Morello, a compiler which lowers specifications roughly equivalent to a few-node XLA computation graph to x86. Notably, we found that an affine cost model is sufficient to surface high-throughput programs. For example, Morello synthesized a collection of matrix multiplication benchmarks targeting a Zen 1 CPU, including a 1x2048x16384, bfloat16-to-float32 vector-matrix multiply, which was integrated into Google's gemma.cpp.

Scaling Optimization Over Uncertainty via Compilation

Programming Languages

Solves hard computer problems faster by organizing them.

26 Feb 2025 1

87%

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

Hardware Architecture

Makes AI run faster and use less power.

5 Dec 2025 1

86%

LIMO: Low-Power In-Memory-Annealer and Matrix-Multiplication Primitive for Edge Computing

Emerging Technologies

Finds best routes faster for big problems.

29 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

13 pages

Morello: Compiling Fast Neural Networks with Dynamic Programming and Spatial Compression

Makes computer programs run much faster.

Technical Abstract

Scaling Optimization Over Uncertainty via Compilation

Compass: Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads

LIMO: Low-Power In-Memory-Annealer and Matrix-Multiplication Primitive for Edge Computing