HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing customized video generation methods suffer from poor identity consistency and limited input modality support. This paper proposes a subject-specific, multimodal controllable video generation framework accepting images, text, audio, and video as conditional inputs. Our contributions are threefold: (1) a LLaVA-based text–image fusion module coupled with an image ID enhancement mechanism to ensure subject identity consistency; (2) an AudioNet-based hierarchical audio–video alignment module and a video-driven patchify feature alignment network to strengthen cross-modal temporal modeling; and (3) a unified architecture integrating latent-space video compression, spatiotemporal feature concatenation, spatial cross-attention, and conditional injection. Experiments demonstrate state-of-the-art performance across both single- and multi-subject scenarios, with significant improvements in identity consistency, visual fidelity, and alignment accuracy with respect to text, audio, and video modalities. Robustness of multimodal-driven generation is also empirically validated.

Technology Category

Application Category

📝 Abstract
Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via spatial cross-attention, and a video-driven injection module that integrates latent-compressed conditional video through a patchify-based feature-alignment network. Extensive experiments on single- and multi-subject scenarios demonstrate that HunyuanCustom significantly outperforms state-of-the-art open- and closed-source methods in terms of ID consistency, realism, and text-video alignment. Moreover, we validate its robustness across downstream tasks, including audio and video-driven customized video generation. Our results highlight the effectiveness of multi-modal conditioning and identity-preserving strategies in advancing controllable video generation. All the code and models are available at https://hunyuancustom.github.io.
Problem

Research questions and friction points this paper is trying to address.

Enhancing identity consistency in customized video generation
Supporting multi-modal inputs like image, audio, video, text
Improving realism and text-video alignment in generated videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-image fusion module for multi-modal understanding
Image ID enhancement via temporal concatenation
Modality-specific condition injection mechanisms
🔎 Similar Papers
No similar papers found.