π€ AI Summary
This work addresses the challenge that existing large audio language models (LALMs) struggle to correct factually incorrect knowledge encoded within their parameters, as conventional text-based editing methods are ill-suited for handling continuous speech representations and cross-modal knowledge localization. To bridge this gap, the study introduces the first knowledge-editing benchmark tailored for LALMs and proposes a speech-driven βlocate-and-editβ framework. This framework leverages voice-aware causal tracing to precisely identify the critical layers and positions within acoustic, linguistic, and cross-modal modules that support factual claims, enabling targeted model edits. Experimental results demonstrate that the proposed approach substantially outperforms both text-based editing and full-model fine-tuning, achieving higher accuracy in knowledge updating while enabling fine-grained control. Notably, it offers the first empirical insight into the joint encoding mechanisms of factual knowledge across multimodal components in LALMs.
π Abstract
Large Audio-Language Models (LALMs) have shown strong performance in speech understanding, making speech a natural interface for accessing factual information. Yet they are trained on static corpora and may encode incorrect facts. Existing model editing methods localize and update facts in text-only LLMs, but do not account for continuous speech representations, or where knowledge is stored across acoustic or language modules, or their cross-modal module. We construct the first audio benchmark for knowledge localization and editing in LALMs and propose a speech-driven locate-then-edit framework. First, we use speech-aware causal tracing to localize layers and modules that support factual retrieval and then apply editing at identified sites. Experiments show that factual knowledge is jointly encoded in audio and text modules, and that audio editing yields more effective updates than text editing or fine-tuning, enabling fine-grained knowledge control in speech AI systems.