🤖 AI Summary
This work addresses the high computational cost of existing video diffusion models, which hinders real-time interactive digital human generation. The authors propose a streaming-oriented framework for talking-head video synthesis that leverages audio and reference image conditioning. Central to their approach is a reference-guided causal video VAE for efficient latent-space compression, augmented with a residual autoencoder architecture to strengthen spatiotemporal causality. Building upon this representation, they introduce a chunk-wise autoregressive latent denoising mechanism coupled with a Rectified Flow Transformer to enable high-speed, high-fidelity video streaming. The method achieves significantly faster inference than baseline models while matching or surpassing state-of-the-art large diffusion models in terms of realism, expressiveness, and overall video quality.
📝 Abstract
Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.