Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
By: Xiao Zhang , Yaoyao Ding , Yang Hu and more
Potential Business Impact:
Makes computer chips run AI tasks much faster.
Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28$\times$ speedup over existing DL compilers for mixed-type operators, and brings up to 2.91$\times$ speedup in the end-to-end evaluation.
Similar Papers
ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming
Computation and Language
Makes AI learn faster by using computer chips better.
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation
Machine Learning (CS)
Makes AI smarter and faster using less power.
TileLang: A Composable Tiled Programming Model for AI Systems
Machine Learning (CS)
Makes AI programs run much faster and easier.