From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Large Vision-Language Models (LVLMs) exhibit limited capability in fine-grained modeling of human-object interactions and bodily motion in third-person daily activity videos, hindering their deployment in eldercare assistance and cognitive assessment. To address this, we propose an online ego2exo knowledge distillation framework, introducing EgoMimic—a novel skeleton-guided method that synthesizes learnable egocentric representations from exocentric videos without requiring paired ground-truth data. By integrating egocentric perceptual cues with LVLMs, our approach significantly enhances semantic understanding of interactions. We evaluate on six established Activities of Daily Living (ADL) benchmarks and a newly curated multiple-choice question benchmark, EgoPerceptionMCQ, achieving substantial performance gains. All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. This limitation is particularly evident in ADL tasks, where understanding detailed human-object interaction and human-centric motion is crucial for applications such as elderly monitoring and cognitive assessment. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose an online ego2exo distillation approach to learn ego-augmented exo representations in LVLMs. While effective, this approach requires paired ego-exo training data, which is impractical to collect for real-world ADL scenarios. Consequently, we develop EgoMimic, a skeleton-guided method that can generate mimicked ego views from exocentric videos. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed EgoPerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at https://github.com/dominickrei/EgoExo4ADL.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Human-Object Interaction

Cognitive Assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

EgoMimic

First-Person Perspective

Visual Language Model Enhancement

🔎 Similar Papers

No similar papers found.

Authors to Follow