Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer
By: Richie Li, Sicheng Chen
Potential Business Impact:
Makes AI models run much faster and cheaper.
Transformer-based large language models (LLMs) rely heavily on intensive matrix multiplications for attention and feed-forward layers, with the Q, K, and V linear projections in the Multi-Head Self-Attention (MHA) module constituting a decisive performance bottleneck. In this work, we introduce a highly optimized tiled matrix multiplication accelerator on a resource-constrained Xilinx KV260 FPGA that not only addresses this challenge but sets a new standard for efficiency and performance. Our design exploits persistent on-chip storage, a robust two-level tiling strategy for maximal data reuse, and a systolic-like unrolled compute engine that together deliver unparalleled speed and energy efficiency. Integrated with DistilBERT for Q, K, and V projections, our accelerator achieves an unequivocal 7x speedup over ARM CPU implementations (PyTorch) and an extraordinary 200x improvement over naive NumPy, reaching a throughput of up to 3.1~GFLOPs for matrix multiplications on (64,768) x (768,3072) matrices while operating at a conservative 100 MHz. These results decisively demonstrate the transformative potential of FPGA-based acceleration for critical Transformer operations, paving the way for scalable and energy-efficient deep learning inference on edge devices.
Similar Papers
MatrixFlow: System-Accelerator co-design for high-performance transformer applications
Hardware Architecture
Makes AI programs run much faster.
On-Device Qwen2.5: Efficient LLM Inference with Model Compression and Hardware Acceleration
Hardware Architecture
Makes smart AI run faster on small devices.
Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow
Hardware Architecture
Makes computer vision faster and use less power.