Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the prevalent yet plausible errors made by current vision-language models (VLMs) in causal physical reasoning. To this end, the authors introduce CausalPhys, a benchmark comprising over 3,000 image/video-based questions annotated with expert-labeled fine-grained causal graphs, spanning four task types: perception, anticipation, intervention, and goal-directed reasoning. For the first time, causal graphs are integrated into VLM evaluation, accompanied by a novel chain-of-reasoning metric based on causal graph alignment and a causal reasoning–aware fine-tuning method, CRFT, establishing a closed-loop framework encompassing data, evaluation, and training. Experiments demonstrate that CRFT substantially improves both accuracy and interpretability of diverse VLM backbones in causal reasoning, while systematically uncovering their fundamental limitations in modeling causal dependencies.

📝 Abstract

Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.

Problem

Research questions and friction points this paper is trying to address.

causal reasoning

physical understanding

vision-language models

causal dependencies

physical world reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal reasoning

vision-language models

causal graph