Peek-a-Boo Reasoning: Contrastive Region Masking in MLLMs

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work investigates the visual dependency mechanisms of multimodal large language models (MLLMs) during chain-of-thought (CoT) reasoning—moving beyond conventional answer-accuracy–centric evaluation to systematically assess reasoning faithfulness, robustness, and fidelity. To this end, we propose Contrastive Region Masking (CRM), a training-free, step-level, causally grounded visual attribution method. CRM systematically masks image regions guided by region annotations and contrasts resulting reasoning trajectories to isolate and diagnose the causal dependence of each reasoning step on specific visual regions. Unlike post-hoc attention visualization or final-output–based evaluation, CRM enables fine-grained, causal diagnosis of visual grounding throughout the reasoning process. It reveals two canonical failure modes—hallucination and overfitting—under partial or missing visual evidence. Empirical validation on benchmarks including VisArgs demonstrates CRM’s effectiveness, advancing multimodal reasoning evaluation toward dual dimensions: robustness and fidelity.

Technology Category

Application Category

📝 Abstract

We introduce Contrastive Region Masking (CRM), a training free diagnostic that reveals how multimodal large language models (MLLMs) depend on specific visual regions at each step of chain-of-thought (CoT) reasoning. Unlike prior approaches limited to final answers or attention maps, CRM provides causal, step-level attri- bution by systematically masking annotated regions and contrasting the resulting reasoning traces with unmasked baselines. Applied to datasets such as VisArgs, CRM reveals distinct failure modes: some models preserve reasoning structure, but hallucinate when evidence is missing, while others ground tightly to visual cues yet collapse under perturbations. By shifting the evaluation from correctness of an- swers to faithfulness of reasoning, CRM reframes visual benchmarks as diagnostic tools, highlighting the need for multimodal evaluation frameworks that measure not just performance, but also robustness and fidelity of reasoning.

Problem

Research questions and friction points this paper is trying to address.

CRM reveals MLLMs' visual region dependencies during reasoning

It contrasts masked and unmasked reasoning traces for causal attribution

CRM shifts evaluation from answer correctness to reasoning faithfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Region Masking reveals MLLM visual dependencies

CRM provides causal step-level attribution by masking regions

Shifts evaluation from answer correctness to reasoning faithfulness

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs