EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses key challenges in first-person video understanding—partial observability, limited field-of-view, and self-motion interference—arising from dynamic egocentric perspectives. We propose a causal-aware two-stage planning-verification framework that integrates top-down intention planning with cross-perspective (first-person → third-person) consistency verification. Our method employs a multimodal reinforcement learning architecture grounded in Group Relative Policy Optimization (GRPO), enabling visually coherent and causally interpretable action reasoning. Crucially, we introduce the first planning framework that explicitly embeds cross-perspective verification into the decision-making process, significantly improving robustness to egocentric scene dynamics. Evaluated on the EgoBlind and EgoOrient benchmarks, our approach achieves absolute accuracy improvements of 7.7% and 4.4%, respectively, while preserving strong generalization to third-person tasks.

Technology Category

Application Category

📝 Abstract

Reasoning about intentions and actions from a first-person (egocentric) perspective remains a fundamental challenge for multimodal large language models (MLLMs). Unlike third-person (exocentric) videos that capture scenes from an outside observer, egocentric videos reflect the actor's continuously changing viewpoint, introducing partial observability, limited field of view, and self-referenced motion. We introduce $ extbf{EgoVITA}$, a reinforcement learning framework that enables MLLMs to reason through structured planning and verification. Built on Group Relative Policy Optimization (GRPO), EgoVITA alternates between two stages: (1) an $ extbf{egocentric planning phase}$, where the model reasons from a first-person viewpoint to predict a step-by-step plan of future actions, and (2) an $ extbf{exocentric verification phase}$, where it switches to a third-person perspective to check the visual and logical consistency of that plan. Through GRPO, the model learns to make plans that are causally predictive of upcoming visual observations, leading to more coherent and visually grounded reasoning. EgoVITA achieves significant gains on egocentric reasoning tasks, outperforming the baseline Qwen2.5-VL-7B by $mathbf{+7.7}$ on EgoBlind and $mathbf{+4.4}$ on EgoOrient, while maintaining strong generalization on exocentric video tasks.

Problem

Research questions and friction points this paper is trying to address.

Addresses egocentric video reasoning challenges for multimodal language models

Develops reinforcement learning framework for structured planning and verification

Improves visual and logical consistency in first-person action prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning framework for egocentric video reasoning

Alternates between egocentric planning and exocentric verification

Uses Group Relative Policy Optimization for visual consistency

🔎 Similar Papers

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models