LTX-Video: Realtime Video Latent Diffusion

📅 2024-12-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in video generation: the trade-off between high compression ratios and high fidelity, and the low efficiency of cross-modal (text/image-to-video) generation. We propose an end-to-end jointly optimized latent diffusion model for video. Our core contributions are: (1) pioneering the early application of patchification—prior to the Video-VAE encoder—to decouple compression from fine-grained detail modeling; (2) enabling full spatiotemporal self-attention in a highly compressed latent space (1:192), with pixel reconstruction and final denoising unified within the VAE decoder; and (3) the first support for joint training and real-time inference across text-to-video and image-to-video modalities. On an H100 GPU, our model generates 5-second videos at 24 fps and 768×512 resolution in just 2 seconds—achieving super-real-time speed and outperforming comparable models. Code and pretrained models are publicly available.

Technology Category

Application Category

📝 Abstract
We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.
Problem

Research questions and friction points this paper is trying to address.

Video Generation
Real-time Processing
Compression vs Detail Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LTX-Video
Transformer-based Video Generation
High-efficiency Compression
Yoav HaCohen
Yoav HaCohen
PhD, Hebrew University, Lightricks
Multimodal Generative AIComputational PhotographyComputer Vision
Nisan Chiprut
Nisan Chiprut
Ligtricks
GenAI
B
Benny Brazowski
Lightricks
D
Daniel Shalem
Lightricks
D
David-Pur Moshe
Lightricks
Eitan Richardson
Eitan Richardson
Researcher, Lightricks Ltd
Deep LearningComputer VisionGenerative AI
E
E. Levin
Lightricks
G
Guy Shiran
Lightricks
Nir Zabari
Nir Zabari
Researcher
Deep LearningComputer VisionImage Processing
O
Ori Gordon
Lightricks
P
Poriya Panet
Lightricks
S
Sapir Weissbuch
Lightricks
V
V. Kulikov
Lightricks
Y
Yaki Bitterman
Lightricks
Z
Zeev Melumian
Lightricks
Ofir Bibi
Ofir Bibi
Lightricks, Hebrew University of Jerusalem
Machine LearningDeep LearningArtificial IntelligenceStatistical Signal Processing