SGPA: Spectrogram-Guided Phonetic Alignment for Feasible Shapley Value Explanations in Multimodal Large Language Models

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenges of applying Shapley values directly to raw audio frames for interpreting end-to-end audio language models, which suffer from computational infeasibility, lack of semantic independence at the frame level, and masking artifacts. To overcome these limitations, the authors propose SGPA, a four-stage pipeline that integrates CTC forced alignment with spectral boundary refinement to generate acoustically stable, word-level aligned audio segments. This approach dramatically reduces the computational complexity of Shapley value estimation—cutting the number of model evaluations by 43× on LFM2-Audio-1.5B and VoiceBench—while significantly improving attribution concentration as validated statistically, without compromising the global cumulative attribution properties.

Technology Category

Application Category

📝 Abstract

Explaining the behavior of end-to-end audio language models via Shapley value attribution is intractable under native tokenization: a typical utterance yields over $150$ encoder frames, inflating the coalition space by roughly $10^{42}$ relative to text; individual audio frames lack standalone meaning; and token boundaries that bisect phonetic transitions introduce masking artifacts. We introduce Spectrogram-Guided Phonetic Alignment (SGPA), a four-stage pipeline that combines Connectionist Temporal Classification forced alignment with spectral boundary refinement to produce acoustically stable, word-aligned audio segments. Controlled diagnostics on LFM2-Audio-1.5B with VoiceBench show that SGPA yields a 43$\times$ reduction in model evaluations. Statistical testing confirms that SGPA significantly alters attribution concentration while preserving the global cumulative profile, establishing it as a feasibility-enabling layer for audio explainability.

Problem

Research questions and friction points this paper is trying to address.

Shapley value

audio explainability

multimodal large language models

phonetic alignment

spectrogram

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapley value

phonetic alignment

spectrogram-guided

audio explainability

multimodal LLM

🔎 Similar Papers

Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification

2024-02-06Citations: 0

Knowing Your Nonlinearities: Shapley Interactions Reveal the Underlying Structure of Data

2024-03-19arXiv.orgCitations: 0

Authors to Follow