Uncovering and Mitigating Transient Blindness in Multimodal Model Editing

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal model editing (MMED) methods rely on low-similarity or random inputs for evaluation, which often masks overfitting and induces “transient visual blindness” in visual question answering (VQA)—a phenomenon where models excessively depend on edited text while disregarding visual inputs. This work formally defines and characterizes transient visual blindness for the first time. We propose a comprehensive locality evaluation framework covering three distinct scenarios: random images, image-absent inputs, and consistent images. To enforce cross-modal balance, we design an adversarial loss mechanism, integrated with dynamic VQA evaluation and token-level analysis for fine-grained quantification of editing effects. Experiments demonstrate that our method improves locality by 17% on average, significantly mitigating transient visual blindness and outperforming state-of-the-art baselines across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Multimodal Model Editing (MMED) aims to correct erroneous knowledge in multimodal models. Existing evaluation methods, adapted from textual model editing, overstate success by relying on low-similarity or random inputs, obscure overfitting. We propose a comprehensive locality evaluation framework, covering three key dimensions: random-image locality, no-image locality, and consistent-image locality, operationalized through seven distinct data types, enabling a detailed and structured analysis of multimodal edits. We introduce De-VQA, a dynamic evaluation for visual question answering, uncovering a phenomenon we term transient blindness, overfitting to edit-similar text while ignoring visuals. Token analysis shows edits disproportionately affect textual tokens. We propose locality-aware adversarial losses to balance cross-modal representations. Empirical results demonstrate that our approach consistently outperforms existing baselines, reducing transient blindness and improving locality by 17% on average.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal model editing with comprehensive locality framework
Addressing transient blindness in edited models ignoring visual inputs
Balancing cross-modal representations using locality-aware adversarial losses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive locality evaluation framework for multimodal edits
Dynamic VQA evaluation uncovering transient blindness phenomenon
Locality-aware adversarial losses balancing cross-modal representations
🔎 Similar Papers
No similar papers found.
X
Xiaoqi Han
School of Computer and Information Technology, Shanxi University, China
Ru Li
Ru Li
Harbin Institute of Technology
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics
H
Hongye Tan
School of Computer and Information Technology, Shanxi University, China
Z
Zhuomin Liang
School of Computer and Information Technology, Shanxi University, China
V
Víctor Gutiérrez-Basulto
School of Computer Science and Informatics, Cardiff University, UK
Jeff Z. Pan
Jeff Z. Pan
Professor of Knowledge Computing, University of Edinburgh
Artificial IntelligenceKnowledge Representation and ReasoningKnowledge Based Learning