Score: 1

Real Time FPGA Based Transformers & VLMs for Vision Tasks: SOTA Designs and Optimizations

Published: September 4, 2025 | arXiv ID: 2509.04162v1

By: Safa Mohammed Sali, Mahmoud Meribout, Ashiyana Abdul Majeed

Potential Business Impact:

Makes smart AI run faster on small devices.

Business Areas:

Field-Programmable Gate Array (FPGA) Hardware

Transformers and vision-language models (VLMs) have emerged as dominant architectures in computer vision and multimodal AI, offering state-of-the-art performance in tasks such as image classification, object detection, visual question answering, and caption generation. However, their high computational complexity, large memory footprints, and irregular data access patterns present significant challenges for deployment in latency- and power-constrained environments. Field-programmable gate arrays (FPGAs) provide an attractive hardware platform for such workloads due to their reconfigurability, fine-grained parallelism, and potential for energy-efficient acceleration. This paper presents a comprehensive review of design trade-offs, optimization strategies, and implementation challenges for FPGA-based inference of transformers and VLMs. We examine critical factors such as device-class selection, memory subsystem constraints, dataflow orchestration, quantization strategies, sparsity exploitation, and toolchain choices, alongside modality-specific issues unique to VLMs, including heterogeneous compute balancing and cross-attention memory management. Additionally, we discuss emerging trends in hardware-algorithm co-design, highlighting innovations in attention mechanisms, compression, and modular overlays to improve efficiency and adaptability. Practical issues such as runtime flexibility, verification overhead, and the absence of standardized FPGA multimodal benchmarks are also considered. Finally, we outline future directions toward scalable, portable, and reconfigurable FPGA solutions that adapt to evolving model architectures while sustaining high utilization and predictable performance. This synthesis offers both a technical foundation and a forward-looking perspective to help bridge the gap between advanced multimodal AI models and efficient FPGA deployment.

Real Time FPGA Based CNNs for Detection, Classification, and Tracking in Autonomous Systems: State of the Art Designs and Optimizations

Hardware Architecture

Makes cameras understand things faster and with less power.

4 Sep 2025 1

89%

TrackCore-F: Deploying Transformer-Based Subatomic Particle Tracking on FPGAs

High Energy Physics - Experiment

Makes AI models run faster on special chips.

30 Sep 2025 0

88%

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Hardware Architecture

Makes AI run faster and use less power.

9 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇦🇪 United Arab Emirates

Page Count

21 pages

Real Time FPGA Based Transformers & VLMs for Vision Tasks: SOTA Designs and Optimizations

Makes smart AI run faster on small devices.

Technical Abstract

Real Time FPGA Based CNNs for Detection, Classification, and Tracking in Autonomous Systems: State of the Art Designs and Optimizations

TrackCore-F: Deploying Transformer-Based Subatomic Particle Tracking on FPGAs

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs