Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational and memory overhead of multi-head self-attention (MHSA) in Vision Transformers (ViTs) deployed on edge devices, this paper proposes Entropy-Aware Attention Map compression (EAM). EAM quantifies the information content of each attention head via Shannon entropy, identifies low-entropy heads, and freezes them—applying extreme low-bit quantization and structured sparsification to eliminate redundant computation. Its core innovation lies in integrating information-theoretic analysis directly into the MHSA compression pipeline, enabling entropy-guided, head-level collaborative compression. Evaluated on ImageNet-1k, EAM maintains or even improves top-1 accuracy for DeiT and Swin models under ≤20% sparsity, while remaining highly competitive at aggressive compression ratios. This work establishes a novel paradigm for efficient ViT deployment on resource-constrained edge platforms.

Technology Category

Application Category

📝 Abstract
Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM achieves similar or higher accuracy at $leq$20% sparsity in attention maps and competitive performance beyond this level for the DeiT and Swin Transformer models.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational complexity of Vision Transformers
Addressing high memory demands from attention mechanisms
Exploiting redundancy in attention maps for compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploits information redundancy in attention maps
Freezes low-entropy attention map weights
Quantizes values to low precision
🔎 Similar Papers
No similar papers found.
Lucas Maisonnave
Lucas Maisonnave
PhD student at CEA
Deep LearningLLMQuantizationCompression
K
Karim Haroun
i3S / CNRS, Université Cˆote d’Azur, Sophia Antipolis, France
T
Tom Pégeot
Université Paris-Saclay CEA, List, F-91120 Palaiseau, France