CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
To address the prohibitively high memory footprint of large language models’ (LLMs) token embedding layers—hindering deployment on resource-constrained edge devices—this paper proposes a hardware-agnostic embedding compression method. Our approach introduces a correction adapter that jointly integrates linear and nonlinear mappings, coupled with grouped residual vector quantization (GRVQ), enabling ultra-high compression (~1.6 bits per parameter) while preserving semantic fidelity. The method is fully compatible with mainstream 4-bit storage hardware and existing Transformer quantization pipelines. Extensive evaluation across multiple state-of-the-art LLMs demonstrates that, compared to scalar quantization, our method achieves lower average bitwidth while maintaining competitive perplexity and task accuracy. Consequently, it significantly improves inference speed and memory efficiency on edge devices, without requiring specialized hardware support.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model's memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
Problem

Research questions and friction points this paper is trying to address.

Compress LLM embedding layers to reduce memory requirements
Achieve high compression rates without specialized hardware support
Maintain model performance while enabling edge device deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines linear and nonlinear maps for embedding compression
Uses group residual vector quantization for compression
Achieves 1.6-bit compression without specialized hardware
🔎 Similar Papers
No similar papers found.