🤖 AI Summary
Existing benchmarks struggle to evaluate multimodal large language models’ capacity to perceive and reason about rapid state changes, concurrent multi-agent behaviors, and action attribution from a first-person perspective in 3D environments. To address this gap, this work proposes an agent-centric video understanding benchmark, featuring a densely annotated dataset of multi-agent 3D gameplay videos with 1.22 labels per second. Temporal event descriptions are structured around a Self-Other-World triadic framework to ensure temporal alignment and semantic clarity. Additionally, a cognitively layered question set comprising 2.4K diagnostic queries—equipped with a distractor taxonomy—enables fine-grained analysis of model hallucinations. Experimental results reveal that state-of-the-art models significantly underperform humans in tasks requiring temporal and cross-video localization, agent attribution, and reasoning in high-decision-density scenarios, exposing critical limitations in current architectures.
📝 Abstract
Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.