SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval

📅 2024-12-16
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Large language models for speech face substantial computational overhead and low representation efficiency when processing long audio sequences (e.g., 90 seconds). To address this, we propose SpeechPrune—a training-free, context-aware token pruning method specifically designed for the speech modality. SpeechPrune dynamically assesses and prunes tokens based on semantic importance, jointly leveraging cross-modal speech–text similarity and approximated self-attention scores. It establishes the first training-free token pruning paradigm tailored to speech. On the SPIRAL benchmark, at a 20% pruning ratio, SpeechPrune improves accuracy by 29% over the baseline model and by 47% over random pruning; remarkably, even at an aggressive 80% pruning ratio, performance remains nearly intact. These results demonstrate that SpeechPrune effectively alleviates both computational and representational bottlenecks in long-context speech information retrieval (SIR) tasks—without requiring any fine-tuning or additional training.

Technology Category

Application Category

📝 Abstract
We introduce Speech Information Retrieval (SIR), a new long-context task for Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample benchmark testing models' ability to extract critical details from approximately 90-second spoken inputs. While current Speech LLMs excel at short-form tasks, they struggle with the computational and representational demands of longer audio sequences. To address this limitation, we propose SpeechPrune, a training-free token pruning strategy that uses speech-text similarity and approximated attention scores to efficiently discard irrelevant tokens. In SPIRAL, SpeechPrune achieves accuracy improvements of 29% and up to 47% over the original model and the random pruning model at a pruning rate of 20%, respectively. SpeechPrune can maintain network performance even at a pruning level of 80%. This approach highlights the potential of token-level pruning for efficient and scalable long-form speech understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhancing long-context speech retrieval for Speech LLMs
Reducing computational load in long audio sequences
Improving accuracy via token pruning without training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free token pruning strategy
Uses speech-text similarity scores
Approximates attention scores efficiently
🔎 Similar Papers
No similar papers found.
Yueqian Lin
Yueqian Lin
PhD Student, Duke University
Yuzhe Fu
Yuzhe Fu
Duke University
Algorithm-hardware co-design
J
Jingyang Zhang
Duke University
Y
Yudong Liu
Duke University
Jianyi Zhang
Jianyi Zhang
Research Scientist@Google Deepmind, PI@Duke University
LLMsGenerative AITrustworthy AI
J
Jingwei Sun
Duke University
H
Hai “Helen” Li
Duke University
Y
Yiran Chen
Duke University