EgoLCD: Egocentric Video Generation with Long Context Diffusion

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating long-duration, temporally coherent first-person videos faces two key challenges: modeling intricate hand-object interactions and mitigating autoregressive content drift and memory forgetting. To address these, we propose a dual-memory mechanism integrating long-term sparse KV caching with short-term attention-based memory, enhanced by a memory-regulated loss and structured narrative prompting to strengthen long-range contextual modeling. Our end-to-end framework employs a diffusion-based architecture incorporating LoRA fine-tuning, sparse attention, and advanced prompt engineering. Evaluated on the EgoVid-5M benchmark, our method achieves state-of-the-art performance, significantly improving perceptual quality and temporal consistency while effectively alleviating identity drift and semantic forgetting during generation.

Technology Category

Application Category

📝 Abstract
Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.
Problem

Research questions and friction points this paper is trying to address.

Generates long, coherent egocentric videos with stable memory
Mitigates content drift and generative forgetting in video synthesis
Enhances temporal consistency for hand-object interaction tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-Term Sparse KV Cache for stable global context
Memory Regulation Loss enforces consistent memory usage
Structured Narrative Prompting provides explicit temporal guidance
🔎 Similar Papers
No similar papers found.
L
Liuzhou Zhang
Peking University
J
Jiarui Ye
Peking University
Y
Yuanlei Wang
Sun Yat-sen University
M
Ming Zhong
Zhejiang University
Mingju Gao
Mingju Gao
Unknown affiliation
Computer VisionRobotics
W
Wanke Xia
Tsinghua University
B
Bowen Zeng
Zhejiang University
Z
Zeyu Zhang
Peking University
H
Hao Tang
Peking University