Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the limitation of existing video generation models—particularly WAN 2.2—which are constrained to monocular, single-view synthesis and cannot perform cross-view generation. We propose a novel fine-tuning-free framework for exocentric-to-egocentric (Exo2Ego) video synthesis. Our method introduces three core components: (1) EgoExo-Align, which enforces latent-space alignment between exocentric and egocentric representations; (2) MultiExoCon, which aggregates multi-exocentric conditioning signals; and (3) PoseInj, which injects pose-aware geometric priors into the latent diffusion process. Leveraging relative camera pose priors alongside multimodal conditions (text, image, and auxiliary exocentric video), the approach operates under a frozen-weight autoencoder-diffusion architecture to ensure geometric consistency and high-fidelity output. Evaluated on ExoEgo4D, our method significantly outperforms all baselines, demonstrating strong generalization and robust cross-view reasoning capability in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract

Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

Unlocking cross-view video synthesis from third-person to first-person perspective

Adapting foundation models for exocentric-to-egocentric video generation

Enabling viewpoint transfer without complete model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts WAN 2.2 for cross-view video synthesis

Aligns exocentric and egocentric first-frame representations

Injects relative camera pose into latent state

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence