Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work identifies two critical anomalies in large language models (LLMs) applied to audio-visual speech recognition (AVSR): (1) attention sinks—abnormal concentration of attention on the beginning-of-sequence (BOS) token and semantically low-content intermediate tokens—and (2) large-scale anomalous activations. It presents the first systematic analysis and empirical validation of attention concentration on intermediate tokens in multimodal speech recognition. To address these issues, we propose a decorrelating loss that explicitly penalizes cosine similarity between the BOS token and all other tokens, thereby mitigating attention drift and activation imbalance. Our method integrates LLM attention pattern analysis, MLP-layer activation statistics, and token-level similarity modeling. Experiments demonstrate that the proposed approach significantly reduces word error rate (WER) under high audio-visual downsampling, while preserving model stability at low sampling rates—thereby enhancing the robustness and generalization capability of AVSR systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.

Problem

Research questions and friction points this paper is trying to address.

Identifies attention sinks in multimodal speech recognition models

Analyzes massive activations originating from MLP layers

Proposes decorrelation loss to mitigate these performance issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studied attention sinks in multimodal speech recognition models

Introduced decorrelation loss to reduce token similarity

Improved recognition accuracy under high feature downsampling

🔎 Similar Papers

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference