Four-Plane Factorized Video Autoencoders

📅 2024-12-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the inefficiency in modeling high-dimensional latent spaces and the substantial computational overhead during training and inference in video generation, this paper proposes the Four-Plane Variational Autoencoder (4P-VAE). The method introduces a novel four-plane factorized latent space architecture, projecting spatiotemporal video volumes onto four orthogonal 2D planes. This design enables sublinear growth of latent dimensionality with respect to input resolution while preserving representation fidelity under high compression ratios. 4P-VAE natively supports diverse downstream tasks—including class-conditional generation, frame prediction, and video interpolation—and integrates seamlessly with latent diffusion models (LDMs) for joint training. Experiments demonstrate that 4P-VAE achieves high-fidelity video reconstruction while significantly accelerating LDM training and inference and reducing GPU memory consumption. Overall, it establishes a new paradigm for efficient latent-space modeling of high-dimensional temporal data.

Technology Category

Application Category

📝 Abstract

Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion models (LDMs), such as class-conditional generation, frame prediction, and video interpolation. Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions despite the heavy compression, while simultaneously enabling LDMs to operate with significant improvements in speed and memory.

Problem

Research questions and friction points this paper is trying to address.

Efficient generative modeling for high-dimensional video data

Challenges in training latent variable models for videos

Designing autoencoders for compressed yet rich video representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Four-plane factorized latent space for videos

Sublinear growth with input size

Efficient latent diffusion models integration

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding