ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

📅 2026-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether existing multimodal large language models (MLLMs) possess genuine cross-modal physical reasoning capabilities or merely rely on linguistic priors to generate hallucinations. To this end, the authors introduce ChronoPhyBench, a novel benchmark that uniquely integrates temporal physical dynamics reasoning with visual question answering through two core tasks: next-state prediction and multi-frame temporal ordering. By leveraging image selection and temporal judgment, the benchmark mitigates language biases inherent in prior evaluations. ChronoPhyBench encompasses video–text alignment, temporal reasoning, and large-scale annotated data construction. Experimental results reveal that current open-source MLLMs significantly underperform previous assessment claims in physical reasoning, remaining at an early developmental stage and exhibiting substantial hallucination issues.
📝 Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in open-world reasoning and understanding. However, a critical ambiguity persists: it remains unclear whether these models genuinely synthesize cross-modal information to construct physically grounded reasoning chains, or if they merely exploit strong language priors to mask single-modality reliance, thereby hallucinating advanced multimodal capabilities. Motivated by this, and to rigorously mitigate language modality bias and shortcuts, we propose a novel multimodal Chrono}logical Physical Dynamics Reasoning Benchmark ChronoPhyBench, which unifies next state prediction with Visual Question Answering (VQA) paradigms by conditioning on historical video context and textual captions to enforce models to deduce subsequent physical states through both single image selection and the inherently more complex task of multiple frame chronological sorting. Concurrently, we construct a large-scale multimodal reasoning dataset curated using the ChronoPhyBench criteria, comprising over 10,000 long-form videos paired with meticulously annotated captions, totaling 5M tokens. Our experimental evaluations reveal a stark contrast to conclusions drawn by previous benchmarks. The capacity of current open-source models to perform physically grounded multimodal reasoning remains in its infancy. Ultimately, this work seeks to systematically stress-test the reasoning capabilities of multimodal models, quantify hallucination rates, and advance the development of Physical AI, thereby providing the community with a robust and transparent evaluation framework toward Artificial General Intelligence (AGI).
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Physical Reasoning
Language Priors
Hallucination
Multimodal Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChronoPhyBench
multimodal reasoning
physical dynamics
language priors
temporal ordering
Bin Zhu
Bin Zhu
Peking University
Yanhao Jia
Yanhao Jia
Nanyang Technological University
Artificial IntelligenceDeep LearningComputational Neuroscience
Kexin Zhao
Kexin Zhao
UNC Charlotte
Information systems
J
Jie Wang
Peking University, Shenzhen Graduate School; Tsinghua University
Munan Ning
Munan Ning
Peking University
H
Hao Li
Peking University, Shenzhen Graduate School; Peng Cheng Laboratory
Yuwei Niu
Yuwei Niu
Chongqing university
Visual RepresentationsLanguage Priors
T
Tanqing Sun
Peking University, Shenzhen Graduate School
H
Huangchong Yan
Peking University, Shenzhen Graduate School
M
Mingjun Pan
Peking University, Shenzhen Graduate School
X
Xinyi Wu
Nanyang Technological University
Q
Qishen Yin
Peking University, Shenzhen Graduate School
Yunyang Ge
Yunyang Ge
北京大学
Shuai Zhao
Shuai Zhao
Postdoctoral, Nanyang Technological University
LLMsModel SecurityBackdoor Attack
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants