An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address shallow audio–text modality fusion in multimodal large language models (MLLMs), which limits semantic reasoning capabilities, this paper proposes an interleaved instruction-tuning framework that dynamically embeds audio tokens into textual prompt sequences to enable fine-grained cross-modal alignment. We introduce SHARD—the first benchmark dataset specifically designed for audio semantic reasoning—covering synonym and hypernym identification tasks. Extensive experiments integrate zero-shot prompting, audio tokenization, and multimodal in-context learning on the LTU model. Results demonstrate a substantial improvement in semantic reasoning performance (+12.3%), albeit with a modest degradation in audio classification accuracy (−3.1%), revealing for the first time a fundamental trade-off between deep modality fusion and functional balance. Our core contributions are: (1) the interleaved fine-tuning paradigm; (2) the SHARD benchmark; and (3) a systematic characterization of the reasoning–classification trade-off.

Technology Category

Application Category

📝 Abstract
Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability.
Problem

Research questions and friction points this paper is trying to address.

Evaluating interleaved instruction tuning for audio MLLM semantic reasoning performance
Addressing limited modality integration in standard MLLM training approaches
Assessing trade-offs between reasoning improvement and audio labeling degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved audio tokens within text prompts
Fine-tuning with interleaved training prompts
Synonym and hypernym audio reasoning benchmark
🔎 Similar Papers
No similar papers found.