The silence of the weights: an investigation of structural pruning strategies for attention-based audio signal architectures

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

Audio Transformer models suffer from excessive parameter count and high computational cost in their self-attention mechanisms. Method: This paper proposes a fine-grained structured pruning approach specifically for Audio Spectrogram Transformers (ASTs). It first decouples the four projection matrices—query, key, value, and output—within each attention block, enabling their independent pruning. Subsequently, it introduces a joint pruning strategy across head dimension and channel dimension to achieve structured sparsity tailored to AST architectures. Contribution/Results: When pruning 50% of attention parameters, the method incurs less than 1% accuracy drop on standard audio classification benchmarks, achieving near-lossless high-ratio compression. Experimental results demonstrate substantial improvements in inference efficiency while preserving model accuracy. The proposed framework establishes a scalable and interpretable paradigm for structured compression of audio Transformers, advancing lightweight deployment of spectrogram-based models.

Technology Category

Application Category

📝 Abstract

Transformer-based models have become the state of the art across multiple domains, from natural language processing to machine listening, thanks to attention mechanisms. However, the attention layers require a large number of parameters and high-end hardware for both training and inference. We propose a novel pruning technique targeted explicitly at the attention mechanism, where we decouple the pruning of the four layers in the attention block, namely: query, keys, values and outputs' projection matrices. We also investigate pruning strategies to prune along the head and channel dimensions, and compare the performance of the Audio Spectrogram Transformer (AST) model under different pruning scenarios. Our results show that even by pruning 50% of the attention parameters we incur in performance degradation of less than 1%

Problem

Research questions and friction points this paper is trying to address.

Reducing parameters in attention layers for efficient Transformer models

Developing pruning strategies for query, key, value and output projections

Maintaining performance while pruning attention parameters by 50%

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples pruning of attention block layers

Prunes along head and channel dimensions

Reduces parameters by 50% with minimal loss

🔎 Similar Papers

No similar papers found.