SGPA: Spectrogram-Guided Phonetic Alignment for Feasible Shapley Value Explanations in Multimodal Large Language Models

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of applying Shapley values directly to raw audio frames for interpreting end-to-end audio language models, which suffer from computational infeasibility, lack of semantic independence at the frame level, and masking artifacts. To overcome these limitations, the authors propose SGPA, a four-stage pipeline that integrates CTC forced alignment with spectral boundary refinement to generate acoustically stable, word-level aligned audio segments. This approach dramatically reduces the computational complexity of Shapley value estimation—cutting the number of model evaluations by 43× on LFM2-Audio-1.5B and VoiceBench—while significantly improving attribution concentration as validated statistically, without compromising the global cumulative attribution properties.

Technology Category

Application Category

📝 Abstract
Explaining the behavior of end-to-end audio language models via Shapley value attribution is intractable under native tokenization: a typical utterance yields over $150$ encoder frames, inflating the coalition space by roughly $10^{42}$ relative to text; individual audio frames lack standalone meaning; and token boundaries that bisect phonetic transitions introduce masking artifacts. We introduce Spectrogram-Guided Phonetic Alignment (SGPA), a four-stage pipeline that combines Connectionist Temporal Classification forced alignment with spectral boundary refinement to produce acoustically stable, word-aligned audio segments. Controlled diagnostics on LFM2-Audio-1.5B with VoiceBench show that SGPA yields a 43$\times$ reduction in model evaluations. Statistical testing confirms that SGPA significantly alters attribution concentration while preserving the global cumulative profile, establishing it as a feasibility-enabling layer for audio explainability.
Problem

Research questions and friction points this paper is trying to address.

Shapley value
audio explainability
multimodal large language models
phonetic alignment
spectrogram
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapley value
phonetic alignment
spectrogram-guided
audio explainability
multimodal LLM
P
Paweł Pozorski
Warsaw University of Technology, Warsaw, Poland
J
Jakub Muszyński
Warsaw University of Technology, Warsaw, Poland
Maria Ganzha
Maria Ganzha
Associate Professor Warsaw University of Technology
Agent-based computingMultiagent systemdistributed systemOntologySemantic Data Processing