Waver: Wave Your Way to Lifelike Video Generation

📅 2025-08-21

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work introduces the first unified image-video foundation model capable of native 720p, 5–10-second high-fidelity video generation, supporting text-to-image (T2I), text-to-video (T2V), and image-to-video (I2V) tasks. To address challenges in cross-modal alignment, training efficiency, and video quality control, the authors propose: (1) a Hybrid Stream DiT architecture that enables streaming cross-modal modeling to improve text-image-video alignment and accelerate convergence; (2) an MLLM-driven multi-stage data curation pipeline for automated video quality assessment and human-in-the-loop filtering; and (3) fine-grained training strategies and efficient inference mechanisms. On the Artificial Analysis benchmark (as of July 30, 2025), the model ranks among the top three for both T2V and I2V—significantly outperforming existing open-source models and matching or exceeding leading commercial solutions.

Technology Category

Application Category

📝 Abstract

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

Problem

Research questions and friction points this paper is trying to address.

Unified framework for text-to-video, image-to-video, and text-to-image generation

Generating high-quality 720p videos with complex motion and temporal consistency

Improving modality alignment and training convergence for video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Stream DiT architecture for modality alignment

MLLM-based video quality model for data filtering

Unified framework for T2V, I2V, and T2I generation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Senior AI Engineer, World Foundation Models

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence