🤖 AI Summary
This work addresses key challenges in first-person video understanding—partial observability, limited field-of-view, and self-motion interference—arising from dynamic egocentric perspectives. We propose a causal-aware two-stage planning-verification framework that integrates top-down intention planning with cross-perspective (first-person → third-person) consistency verification. Our method employs a multimodal reinforcement learning architecture grounded in Group Relative Policy Optimization (GRPO), enabling visually coherent and causally interpretable action reasoning. Crucially, we introduce the first planning framework that explicitly embeds cross-perspective verification into the decision-making process, significantly improving robustness to egocentric scene dynamics. Evaluated on the EgoBlind and EgoOrient benchmarks, our approach achieves absolute accuracy improvements of 7.7% and 4.4%, respectively, while preserving strong generalization to third-person tasks.
📝 Abstract
Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $ extbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $ extbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $ extbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $mathbf{+7.7}$ on EgoBlind and $mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.