VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit insufficient robustness under real-world data contamination, suffer from opaque behavioral mechanisms, and lack targeted data augmentation strategies. Method: This paper introduces the first visual analytics framework specifically designed for VLM contamination robustness. It integrates multi-view visualization, performance degradation attribution, fine-grained data slicing detection, and interactive exploration to analyze contamination effects across task-, sample-, and feature-level abstractions. Contribution/Results: As the first work to apply visual analytics to VLM robustness research, it enables synergistic optimization of model diagnosis and data augmentation. Experiments on image captioning demonstrate that the framework significantly enhances understanding of model vulnerability patterns and generates interpretable, deployable augmentation strategies—yielding an average robustness improvement of 12.7%.

Technology Category

Application Category

📝 Abstract
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.
Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language model robustness against data corruption
Identifying underperformed samples to guide data augmentation strategies
Understanding model behavior under various corruption types
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual analytics framework for corruption robustness
Multi-level analysis of model behavior
Guiding data augmentation strategies through visualization
🔎 Similar Papers
No similar papers found.
Huanchen Wang
Huanchen Wang
City University of Hong Kong; Southern University of Science and Technology
HCIVisualizationHuman-AI CollaborationIntangible Cultural HeritageGenerative AI
W
Wencheng Zhang
Southern University of Science and Technology, China
Z
Zhiqiang Wang
Southern University of Science and Technology, China
Zhicong Lu
Zhicong Lu
Assistant Professor, George Mason University
HCIsocial computinglive streamingcreativity supportintangible cultural heritage
Y
Yuxin Ma
Southern University of Science and Technology, China