Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

๐Ÿ“… 2026-06-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

230K/year
๐Ÿค– AI Summary
This work addresses the prevalent yet plausible errors made by current vision-language models (VLMs) in causal physical reasoning. To this end, the authors introduce CausalPhys, a benchmark comprising over 3,000 image/video-based questions annotated with expert-labeled fine-grained causal graphs, spanning four task types: perception, anticipation, intervention, and goal-directed reasoning. For the first time, causal graphs are integrated into VLM evaluation, accompanied by a novel chain-of-reasoning metric based on causal graph alignment and a causal reasoningโ€“aware fine-tuning method, CRFT, establishing a closed-loop framework encompassing data, evaluation, and training. Experiments demonstrate that CRFT substantially improves both accuracy and interpretability of diverse VLM backbones in causal reasoning, while systematically uncovering their fundamental limitations in modeling causal dependencies.
๐Ÿ“ Abstract
Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.
Problem

Research questions and friction points this paper is trying to address.

causal reasoning
physical understanding
vision-language models
causal dependencies
physical world reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning
vision-language models
causal graph
physical understanding
fine-tuning