BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

📅 2025-01-13

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current video generation models struggle with fine-grained controllability—particularly for multi-object coordinated motion, appearance evolution, and cross-object transformations—under complex text prompts. To address this, we propose a blob-based video representation paradigm that disentangles videos into independently controllable motion and appearance components. We introduce a masked 3D attention mechanism to enhance inter-frame spatial consistency and a learnable text embedding interpolation module for frame-level semantic precision. Our method integrates blob-conditioned diffusion modeling with dual-architecture adaptation (U-Net and DiT). Evaluated on multiple benchmarks, it achieves state-of-the-art zero-shot generation quality and layout controllability. When augmented with LLM-driven layout planning, our approach surpasses leading closed-source models in compositional accuracy.

Technology Category

Application Category

📝 Abstract

Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.

Problem

Research questions and friction points this paper is trying to address.

Video Generation

Complex Text Descriptions

Multiple Object Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

BlobGEN-Vid

Video Generation

Layout Control

🔎 Similar Papers

Grid Diffusion Models for Text-to-Video Generation

2024-03-30Computer Vision and Pattern RecognitionCitations: 2

Apple

Cupertino, United States of America

AI Research Scientist, Computer Vision - Facebook Video Intelligence