From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models exhibit fragility in multi-image causal reasoning tasks, struggling with interventional and counterfactual queries and often relying on external textual prompts to inject causal knowledge. This work proposes the first approach that internalizes causal mechanisms directly into the model’s execution pipeline: it constructs multi-image causal graphs encoded as structured causal tokens, introduces a RAMP layer embedded within the LLM decoder to facilitate causal information propagation, and presents a unified training framework, M3S, enabling multi-granular causal supervision at both local and global levels. The method achieves state-of-the-art performance, attaining 54.4% accuracy (+21.2%) on the CausalVLBench interventional tasks, improving accuracy to 49.0% on Causal3D, and significantly boosting the F1 score for causal structure learning to 75.1% (+41.7%).

📝 Abstract

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

Problem

Research questions and friction points this paper is trying to address.

causal reasoning

vision-language models

multi-image inputs

interventional queries

counterfactual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Tokens

RAMP layers

internalized causal reasoning