EgoLCD: Egocentric Video Generation with Long Context Diffusion

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Generating long-duration, temporally coherent first-person videos faces two key challenges: modeling intricate hand-object interactions and mitigating autoregressive content drift and memory forgetting. To address these, we propose a dual-memory mechanism integrating long-term sparse KV caching with short-term attention-based memory, enhanced by a memory-regulated loss and structured narrative prompting to strengthen long-range contextual modeling. Our end-to-end framework employs a diffusion-based architecture incorporating LoRA fine-tuning, sparse attention, and advanced prompt engineering. Evaluated on the EgoVid-5M benchmark, our method achieves state-of-the-art performance, significantly improving perceptual quality and temporal consistency while effectively alleviating identity drift and semantic forgetting during generation.

Technology Category

Application Category

📝 Abstract

Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.

Problem

Research questions and friction points this paper is trying to address.

Generates long, coherent egocentric videos with stable memory

Mitigates content drift and generative forgetting in video synthesis

Enhances temporal consistency for hand-object interaction tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-Term Sparse KV Cache for stable global context

Memory Regulation Loss enforces consistent memory usage

Structured Narrative Prompting provides explicit temporal guidance

🔎 Similar Papers

No similar papers found.

Authors to Follow