Unified Dense Prediction of Video Diffusion

📅 2025-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of synchronized dense-prediction annotations (e.g., instance segmentation, depth maps) and unified modeling approaches in text-to-video generation. Methodologically, it introduces the first end-to-end unified video generation framework: (1) it constructs the first large-scale multimodal dataset jointly annotated with text descriptions, videos, pixel-level instance segmentation masks, and dense depth maps; (2) it proposes learnable task embeddings and a multi-head dense-prediction decoder that encodes RGB videos, segmentation maps, and depth maps into unified colormap tensors, all embedded within a single diffusion model; and (3) it enables multi-task joint generation without additional computational overhead. Experiments demonstrate state-of-the-art performance across video fidelity, temporal consistency, and motion smoothness, while supporting flexible extension to new dense prediction tasks.

Technology Category

Application Category

📝 Abstract
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.
Problem

Research questions and friction points this paper is trying to address.

Simultaneously generate videos, segmentation, and depth maps from text prompts.
Improve video consistency and motion smoothness without extra computational cost.
Address lack of datasets with captions, videos, segmentation, and depth maps.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified network for video and dense prediction generation
Colormap integration enhances consistency and motion smoothness
Learnable task embeddings unify multiple dense prediction tasks
🔎 Similar Papers
No similar papers found.