EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the lack of benchmarks for evaluating Theory of Mind (ToM) reasoning in first-person (egocentric) vision. We introduce EgoToM, the first egocentric ToM benchmark, constructed from Ego4D via causal-driven generation of multiple-choice video question-answering samples covering three core tasks: intention recognition, immediate belief inference, and future action prediction. Methodologically, we propose a causal ToM modeling paradigm to enhance reasoning fidelity in sample construction. We systematically evaluate state-of-the-art multimodal large language models (MLLMs), revealing that while they achieve near-human performance on goal inference, they exhibit significant bottlenecks—regardless of scale (including billion-parameter models)—in belief-state reasoning and action prediction. This work establishes a reproducible, fine-grained evaluation framework and foundational methodology for embodied intelligence research, enabling rigorous assessment of mental state modeling in egocentric perception.

Technology Category

Application Category

📝 Abstract

We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind (ToM) evaluation to egocentric domains. Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions. We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems. Our evaluation shows that MLLMs achieve close to human-level accuracy on inferring goals from egocentric videos. However, MLLMs (including the largest ones we tested with over 100B parameters) fall short of human performance when inferring the camera wearers' in-the-moment belief states and future actions that are most consistent with the unseen video future. We believe that our results will shape the future design of an important class of egocentric digital assistants which are equipped with a reasonable model of the user's internal mental states.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Theory of Mind in egocentric video understanding

Assessing goal, belief, and action prediction from videos

Comparing human and AI performance on mental state inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal ToM model for video QA

Benchmarking goals, beliefs, actions

Multimodal LLMs vs human performance

🔎 Similar Papers

No similar papers found.

Authors to Follow