Score: 1

HaShiFlex: A High-Throughput Hardened Shifter DNN Accelerator with Fine-Tuning Flexibility

Published: December 14, 2025 | arXiv ID: 2512.12847v1

By: Jonathan Herbst, Michael Pellauer, Sherief Reda

BigTech Affiliations: NVIDIA

Potential Business Impact:

Makes computers process images much faster and use less power.

Business Areas:

Field-Programmable Gate Array (FPGA) Hardware

We introduce a high-throughput neural network accelerator that embeds most network layers directly in hardware, minimizing data transfer and memory usage while preserving a degree of flexibility via a small neural processing unit for the final classification layer. By leveraging power-of-two (Po2) quantization for weights, we replace multiplications with simple rewiring, effectively reducing each convolution to a series of additions. This streamlined approach offers high-throughput, energy-efficient processing, making it highly suitable for applications where model parameters remain stable, such as continuous sensing tasks at the edge or large-scale data center deployments. Furthermore, by including a strategically chosen reprogrammable final layer, our design achieves high throughput without sacrificing fine-tuning capabilities. We implement this accelerator in a 7nm ASIC flow using MobileNetV2 as a baseline and report throughput, area, accuracy, and sensitivity to quantization and pruning - demonstrating both the advantages and potential trade-offs of the proposed architecture. We find that for MobileNetV2, we can improve inference throughput by 20x over fully programmable GPUs, processing 1.21 million images per second through a full forward pass while retaining fine-tuning flexibility. If absolutely no post-deployment fine tuning is required, this advantage increases to 67x at 4 million images per second.

FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks

Machine Learning (CS)

Makes AI see better and faster.

3 Oct 2025 0

88%

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations

Hardware Architecture

Makes AI learn faster and use less power.

9 Oct 2025 0

88%

Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer

Hardware Architecture

Makes AI models run much faster and cheaper.

20 Mar 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

12 pages

HaShiFlex: A High-Throughput Hardened Shifter DNN Accelerator with Fine-Tuning Flexibility

Makes computers process images much faster and use less power.

Technical Abstract

FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks

A Scalable FPGA Architecture With Adaptive Memory Utilization for GEMM-Based Operations

Design and Implementation of an FPGA-Based Hardware Accelerator for Transformer