Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Existing generative video models struggle to achieve human-centric camera control aligned with cinematic language, often producing random camera trajectories, spatial inconsistencies, and insufficient focus on the human subject. This work proposes a human-centric camera parameterization method that formalizes cinematic composition principles into computable, human-relative camera parameters for the first time. It introduces a domain-specific language (DSL) that coordinates with a multimodal large language model to map natural language instructions and human motion into cinematic keyframe shots, followed by deterministic interpolation to generate smooth, continuous camera trajectories. Evaluated on a newly curated dataset of 34K text–motion–camera aligned samples, the approach significantly outperforms existing methods on composition-oriented metrics, enabling controllable and aesthetically cinematic human-centric video generation.

📝 Abstract

Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods

Problem

Research questions and friction points this paper is trying to address.

camera control

human-centric video generation

cinematographic framing

generative video models

intentional camera motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

human-centric camera framing

Domain-Specific Language (DSL)

language-driven cinematography