Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the visual dependency mechanisms of multimodal large language models (MLLMs) during chain-of-thought (CoT) reasoning—moving beyond conventional answer-accuracy–centric evaluation to systematically assess reasoning faithfulness, robustness, and fidelity. To this end, we propose Contrastive Region Masking (CRM), a training-free, step-level, causally grounded visual attribution method. CRM systematically masks image regions guided by region annotations and contrasts resulting reasoning trajectories to isolate and diagnose the causal dependence of each reasoning step on specific visual regions. Unlike post-hoc attention visualization or final-output–based evaluation, CRM enables fine-grained, causal diagnosis of visual grounding throughout the reasoning process. It reveals two canonical failure modes—hallucination and overfitting—under partial or missing visual evidence. Empirical validation on benchmarks including VisArgs demonstrates CRM’s effectiveness, advancing multimodal reasoning evaluation toward dual dimensions: robustness and fidelity.

Technology Category

Application Category

📝 Abstract
We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attri- bution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of an- swers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.
Problem

Research questions and friction points this paper is trying to address.

CRM reveals MLLMs' visual region dependencies during reasoning
It contrasts masked and unmasked reasoning traces for causal attribution
CRM shifts evaluation from answer correctness to reasoning faithfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Region Masking reveals MLLM visual dependencies
CRM provides causal step-level attribution by masking regions
Shifts evaluation from answer correctness to reasoning faithfulness
🔎 Similar Papers
No similar papers found.