Benchmarking Visual State Tracking in Multimodal Video Understanding

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the lack of effective evaluation of visual state tracking—the ability to maintain and integrate visual information across video frames—in current multimodal large language models, which struggle with tasks requiring cross-frame reasoning. To this end, the study formally introduces visual state tracking as a core capability and presents VSTAT, a new benchmark comprising 834 video clips and 1,500 temporally grounded questions that demand holistic video understanding, combining synthetic and real-world data to create challenging question-answering tasks. Through trajectory-of-thought analysis and video-stream alignment techniques, the authors demonstrate that state-of-the-art models perform substantially worse than humans, barely surpassing answer-prior baselines, and that existing agent-based approaches fail to yield significant improvements, revealing fundamental limitations in modeling continuous visual events.
📝 Abstract
Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.
Problem

Research questions and friction points this paper is trying to address.

visual state tracking
multimodal large language models
video understanding
temporal reasoning
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual state tracking
multimodal large language models
video understanding benchmark
temporal reasoning
perception-reasoning gap
🔎 Similar Papers