Score: 0

F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

Published: October 15, 2025 | arXiv ID: 2510.13401v1

By: Jude Haris, José Cano

Potential Business Impact:

Makes smart computer programs run faster on phones.

Business Areas:

Field-Programmable Gate Array (FPGA) Hardware

Large Language Models (LLMs) have become increasingly prominent for daily tasks, from improving sound-totext translation to generating additional frames for the latest video games. With the help of LLM inference frameworks, such as llama.cpp, which support optimizations such as KV-caching and quantization, it is now easier than ever to deploy LLMs on edge devices. Quantization is fundamental to enable LLMs on resource-constrained edge devices, and llama.cpp utilizes block floating point (BFP) quantization to drastically reduce the bit width of weights and input tensors, the memory footprint, and the computational power required to run LLMs. LLMs are typically quantized with mixed BFP quantization across the model layers to reduce the loss of model accuracy due to quantization. Therefore, to efficiently accelerate across the layers of BFP-quantized LLMs, specialized accelerators need to support different BFP variants without reconfiguration. To address this issue, we propose a Flexible Block FloatingPoint Quantization (F-BFQ) accelerator, which can dynamically switch between two BFP quantization variants and perform matrix multiplication (MatMul) operations. Our initial F-BFQ accelerator design, deployed on the AMD Kria board, reduces inference time by 1.4x on average over the Arm NEON-based CPU execution across three BFP quantized LLMs while achieving 5.2 tokens per second (~3.9 words per second).

BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

Hardware Architecture

Makes smart computer brains work faster on small devices.

22 Apr 2025 0

90%

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Hardware Architecture

Makes AI run faster and use less power.

9 Nov 2025 1

89%

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization

Machine Learning (CS)

Makes AI models faster without losing smartness.

21 May 2025 0

View PDF Login to Bookmark

Page Count

5 pages

F-BFQ: Flexible Block Floating-Point Quantization Accelerator for LLMs

Makes smart computer programs run faster on phones.

Technical Abstract

BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization