EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of benchmarks for evaluating Theory of Mind (ToM) reasoning in first-person (egocentric) vision. We introduce EgoToM, the first egocentric ToM benchmark, constructed from Ego4D via causal-driven generation of multiple-choice video question-answering samples covering three core tasks: intention recognition, immediate belief inference, and future action prediction. Methodologically, we propose a causal ToM modeling paradigm to enhance reasoning fidelity in sample construction. We systematically evaluate state-of-the-art multimodal large language models (MLLMs), revealing that while they achieve near-human performance on goal inference, they exhibit significant bottlenecks—regardless of scale (including billion-parameter models)—in belief-state reasoning and action prediction. This work establishes a reproducible, fine-grained evaluation framework and foundational methodology for embodied intelligence research, enabling rigorous assessment of mental state modeling in egocentric perception.

Technology Category

Application Category

📝 Abstract
We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.
Problem

Research questions and friction points this paper is trying to address.

Evaluating Theory of Mind in egocentric video understanding
Assessing goal, belief, and action prediction from videos
Comparing human and AI performance on mental state inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal ToM model for video QA
Benchmarking goals, beliefs, actions
Multimodal LLMs vs human performance
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Li
Reality Labs
V
Vijay Veerabadran
Reality Labs
M
Michael L. Iuzzolino
Reality Labs
B
Brett D. Roads
Reality Labs
Asli Celikyilmaz
Asli Celikyilmaz
Researcher @ FAIR at Meta
Deep LearningNatural Language Processing
Karl Ridgeway
Karl Ridgeway
Facebook
Factorial RepresentationsFew-shot learningdeep embeddings