DirectorLLM for Human-Centric Video Generation

๐Ÿ“… 2024-12-19
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 5
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address insufficient human motion realism and interactivity in video generation, this paper proposes DirectorLLMโ€”a novel paradigm that empowers a large language model (LLM) as a โ€œvideo director.โ€ Rather than serving solely as a text generator, the LLM is reconfigured as a central orchestrator for human motion modeling and video composition, decoupling motion synthesis from rendering to enable plug-and-play integration with diverse architectures (e.g., UNet, DiT). Built upon fine-tuned Llama 3, DirectorLLM integrates instruction-driven pose signal generation, conditional video rendering, and multimodal alignment. Extensive experiments demonstrate that DirectorLLM significantly outperforms state-of-the-art text-to-video methods in human motion fidelity, prompt adherence, and subject naturalness. Both automated metrics and human evaluations consistently validate its effectiveness.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic human motion in videos
Enhancing authenticity of human interactions
Improving prompt-following video generation fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM orchestrates human poses in videos
Generates detailed instructional signals for motion
Compatible with various video renderers like UNet
๐Ÿ”Ž Similar Papers
No similar papers found.