Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the susceptibility of multimodal large language models to interference during task switching—a phenomenon that has lacked systematic evaluation. The authors propose the first benchmark framework specifically designed to assess such task interference, comprising six tasks spanning textual and visual modalities. By deliberately introducing mismatches between historical and target tasks along three dimensions—modality, reasoning type, and answer format—the work systematically analyzes their impact on model performance. Experiments reveal that interference is directional: performance drops significantly when switching from text-only to image-based tasks, whereas the reverse transition exhibits minimal degradation. Modality mismatch is identified as the primary contributor, followed by answer format, with reasoning type showing the weakest effect. Moreover, concurrent mismatches across multiple dimensions compound performance decline, offering new empirical insights and a foundation for mitigating multimodal task interference.

Technology Category

Application Category

📝 Abstract
Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.
Problem

Research questions and friction points this paper is trying to address.

multimodal task interference
history-target mismatch
multimodal LLMs
task switching
performance degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal task interference
history-target mismatch
modality mismatch
multimodal LLMs
task switching
🔎 Similar Papers
No similar papers found.
M
Masayuki Kawarada
Artificial Intelligence Research Center, AIST
Tatsuya Ishigaki
Tatsuya Ishigaki
National Institute of Advanced Industrial Science and Technology (AIST)
Natural Language ProcessingText GenerationText Summarization
H
Hiroya Takamura
Artificial Intelligence Research Center, AIST