Greedy-Gnorm: A Gradient Matrix Norm-Based Alternative to Attention Entropy for Head Pruning

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work proposes Greedy-Gnorm, a novel attention head pruning algorithm that addresses the limitation of existing methods relying on static importance scores, which fail to capture the dynamic relevance of attention heads during iterative pruning. Greedy-Gnorm introduces, for the first time, a gradient-based dynamic evaluation mechanism: after each pruning step, it computes an importance score for each attention head as the element-wise product of the L2 norms of the Q, K, and V gradient blocks derived from the validation set. Coupled with a greedy selection strategy, this approach effectively mitigates the staleness of head rankings inherent in static scoring schemes. Extensive experiments demonstrate that Greedy-Gnorm consistently outperforms entropy-based pruning methods across BERT, ALBERT, RoBERTa, and XLM-RoBERTa, maintaining superior task accuracy even under high pruning ratios.

Technology Category

Application Category

📝 Abstract

Attention head pruning has emerged as an effective technique for transformer model compression, an increasingly important goal in the era of Green AI. However, existing pruning methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head pruning algorithm that dynamically recalculates head importance after each pruning step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as pruning progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient transformer model deployment.

Problem

Research questions and friction points this paper is trying to address.

attention head pruning

static importance scores

dynamic importance

transformer compression

Green AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

head pruning

gradient norm

dynamic importance scoring