BBAL: A Bidirectional Block Floating Point-Based Quantisation Accelerator for Large Language Models

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on edge devices is hindered by stringent memory and computational constraints. Conventional block floating-point (BFP) quantization—enforcing uniform exponent alignment across all values within a block using the block’s maximum magnitude—incurs significant precision loss, particularly for small- and medium-amplitude data. To address this, we propose bidirectional block floating-point (BBFP), a novel data format that employs a bidirectional exponent selection mechanism to reduce the probability of extreme values dominating the shared exponent. We further design a full-stack BBFP accelerator architecture, comprising a BBFP processing unit array and low-overhead nonlinear computation units, co-optimized via hardware-aware quantization mapping. Experimental results demonstrate that our approach achieves a 22% accuracy improvement over outlier-aware accelerators at comparable energy efficiency, and delivers a 40% energy-efficiency gain over BFP-based accelerators at equivalent accuracy.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs), with their billions of parameters, pose substantial challenges for deployment on edge devices, straining both memory capacity and computational resources. Block Floating Point (BFP) quantisation reduces memory and computational overhead by converting high-overhead floating point operations into low-bit fixed point operations. However, BFP requires aligning all data to the maximum exponent, which causes loss of small and moderate values, resulting in quantisation error and degradation in the accuracy of LLMs. To address this issue, we propose a Bidirectional Block Floating Point (BBFP) data format, which reduces the probability of selecting the maximum as shared exponent, thereby reducing quantisation error. By utilizing the features in BBFP, we present a full-stack Bidirectional Block Floating Point-Based Quantisation Accelerator for LLMs (BBAL), primarily comprising a processing element array based on BBFP, paired with proposed cost-effective nonlinear computation unit. Experimental results show BBAL achieves a 22% improvement in accuracy compared to an outlier-aware accelerator at similar efficiency, and a 40% efficiency improvement over a BFP-based accelerator at similar accuracy.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory and computational overhead in LLMs for edge deployment
Mitigating quantisation error caused by Block Floating Point alignment
Improving accuracy and efficiency in LLM quantisation accelerators
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Block Floating Point (BBFP) data format
Processing element array based on BBFP
Cost-effective nonlinear computation unit
🔎 Similar Papers
No similar papers found.
Xiaomeng Han
Xiaomeng Han
Southeast University
LLMs Accelerator
Y
Yuan Cheng
Houmo AI, Nanjing University
J
Jing Wang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
J
Junyang Lu
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
H
Hui Wang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
X
X.x. Zhang
Jilin Normal University
N
Ning Xu
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University
D
Dawei Yang
Houmo AI
Z
Zhe Jiang
National Center of Technology Innovation for EDA, School of Integrated Circuits, Southeast University