MLoRQ: Bridging Low-Rank and Quantization for Transformer Compression

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address memory and computational constraints in deploying Transformer models on edge devices, this paper proposes a two-stage collaborative compression method integrating low-rank approximation and mixed-precision quantization. It is the first to jointly model low-rank decomposition and quantization within a hierarchical optimization framework that automatically allocates layer-wise ranks and bit-widths under strict memory budgets, enabling cross-layer co-optimization. To mitigate accuracy degradation induced by joint compression, a serialized adaptive rounding technique is introduced—compatible with mainstream quantization algorithms. Extensive experiments on image classification, object detection, and instance segmentation demonstrate state-of-the-art performance, with up to 15% higher accuracy compared to existing methods.

Technology Category

Application Category

📝 Abstract

Deploying transformer-based neural networks on resource-constrained edge devices presents a significant challenge. This challenge is often addressed through various techniques, such as low-rank approximation and mixed-precision quantization. In this work, we introduce Mixed Low-Rank and Quantization (MLoRQ), a novel method that integrates both techniques. MLoRQ employs a two-stage optimization process to determine optimal bit-width and rank assignments for each layer, adhering to predefined memory constraints. This process includes: (i) an intra-layer optimization that identifies potentially optimal compression solutions out of all low-rank and quantization combinations; (ii) an inter-layer optimization that assigns bit-width precision and rank to each layer while ensuring the memory constraint is met. An optional final step applies a sequential optimization process using a modified adaptive rounding technique to mitigate compression-induced errors in joint low-rank approximation and quantization. The method is compatible and can be seamlessly integrated with most existing quantization algorithms. MLoRQ shows state-of-the-art results with up to 15% performance improvement, evaluated on Vision Transformers for image classification, object detection, and instance segmentation tasks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing transformer compression for edge devices

Combining low-rank and quantization techniques effectively

Balancing memory constraints with performance improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage optimization for bit-width and rank

Intra-layer and inter-layer compression optimization

Compatible with existing quantization algorithms

🔎 Similar Papers

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices