🤖 AI Summary
This work addresses the lack of synchronized dense-prediction annotations (e.g., instance segmentation, depth maps) and unified modeling approaches in text-to-video generation. Methodologically, it introduces the first end-to-end unified video generation framework: (1) it constructs the first large-scale multimodal dataset jointly annotated with text descriptions, videos, pixel-level instance segmentation masks, and dense depth maps; (2) it proposes learnable task embeddings and a multi-head dense-prediction decoder that encodes RGB videos, segmentation maps, and depth maps into unified colormap tensors, all embedded within a single diffusion model; and (3) it enables multi-task joint generation without additional computational overhead. Experiments demonstrate state-of-the-art performance across video fidelity, temporal consistency, and motion smoothness, while supporting flexible extension to new dense prediction tasks.
📝 Abstract
We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.