Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures
By: Marco Siracusa , Olivia Hsu , Victor Soria-Pardos and more
Potential Business Impact:
Makes computer recommendations much faster.
Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6$\times$ higher performance and 6.4$\times$ higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.
Similar Papers
Compiler Support for Speculation in Decoupled Access/Execute Architectures
Performance
Makes computers run faster by guessing ahead.
A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization
Machine Learning (CS)
Lets computers use new chips faster.
A Tensor Compiler for Processing-In-Memory Architectures
Hardware Architecture
Makes AI models run much faster on new chips.