ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates whether existing multimodal large language models (MLLMs) possess genuine cross-modal physical reasoning capabilities or merely rely on linguistic priors to generate hallucinations. To this end, the authors introduce ChronoPhyBench, a novel benchmark that uniquely integrates temporal physical dynamics reasoning with visual question answering through two core tasks: next-state prediction and multi-frame temporal ordering. By leveraging image selection and temporal judgment, the benchmark mitigates language biases inherent in prior evaluations. ChronoPhyBench encompasses video–text alignment, temporal reasoning, and large-scale annotated data construction. Experimental results reveal that current open-source MLLMs significantly underperform previous assessment claims in physical reasoning, remaining at an early developmental stage and exhibiting substantial hallucination issues.

📝 Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in open-world reasoning and understanding. However, a critical ambiguity persists: it remains unclear whether these models genuinely synthesize cross-modal information to construct physically grounded reasoning chains, or if they merely exploit strong language priors to mask single-modality reliance, thereby hallucinating advanced multimodal capabilities. Motivated by this, and to rigorously mitigate language modality bias and shortcuts, we propose a novel multimodal Chrono}logical Physical Dynamics Reasoning Benchmark ChronoPhyBench, which unifies next state prediction with Visual Question Answering (VQA) paradigms by conditioning on historical video context and textual captions to enforce models to deduce subsequent physical states through both single image selection and the inherently more complex task of multiple frame chronological sorting. Concurrently, we construct a large-scale multimodal reasoning dataset curated using the ChronoPhyBench criteria, comprising over 10,000 long-form videos paired with meticulously annotated captions, totaling 5M tokens. Our experimental evaluations reveal a stark contrast to conclusions drawn by previous benchmarks. The capacity of current open-source models to perform physically grounded multimodal reasoning remains in its infancy. Ultimately, this work seeks to systematically stress-test the reasoning capabilities of multimodal models, quantify hallucination rates, and advance the development of Physical AI, thereby providing the community with a robust and transparent evaluation framework toward Artificial General Intelligence (AGI).

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Physical Reasoning

Language Priors

Hallucination

Multimodal Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

ChronoPhyBench

multimodal reasoning

physical dynamics