Agentic Aerial Cinematography: From Dialogue Cues to Cinematic Trajectories

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing drone-based indoor cinematography relies heavily on manual waypoint and viewpoint specification, resulting in labor-intensive workflows and inconsistent aesthetic quality. To address this, we propose the first natural language–guided autonomous indoor aerial cinematography system. Our method integrates large language models (LLMs) and vision foundation models (VFMs) into a closed-loop pipeline: “instruction parsing → trajectory generation → aesthetics-driven optimization → safe execution.” It employs vision-language retrieval to interpret open-vocabulary instructions, preference-based Bayesian optimization for aesthetic pose refinement, and safety-aware quadrotor motion planning to generate executable trajectories. Evaluated in simulation and hardware-in-the-loop experiments across diverse indoor environments, the system consistently produces professional-grade video sequences that accurately follow free-form natural language commands. Crucially, it significantly reduces reliance on domain expertise in robotics and cinematic production.

Technology Category

Application Category

📝 Abstract
We present Agentic Aerial Cinematography: From Dialogue Cues to Cinematic Trajectories (ACDC), an autonomous drone cinematography system driven by natural language communication between human directors and drones. The main limitation of previous drone cinematography workflows is that they require manual selection of waypoints and view angles based on predefined human intent, which is labor-intensive and yields inconsistent performance. In this paper, we propose employing large language models (LLMs) and vision foundation models (VFMs) to convert free-form natural language prompts directly into executable indoor UAV video tours. Specifically, our method comprises a vision-language retrieval pipeline for initial waypoint selection, a preference-based Bayesian optimization framework that refines poses using aesthetic feedback, and a motion planner that generates safe quadrotor trajectories. We validate ACDC through both simulation and hardware-in-the-loop experiments, demonstrating that it robustly produces professional-quality footage across diverse indoor scenes without requiring expertise in robotics or cinematography. These results highlight the potential of embodied AI agents to close the loop from open-vocabulary dialogue to real-world autonomous aerial cinematography.
Problem

Research questions and friction points this paper is trying to address.

Autonomous drone cinematography using natural language
Converting free-form language prompts to UAV trajectories
Generating professional aerial footage without manual expertise
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs and VFMs convert prompts to UAV tours
Bayesian optimization refines poses with feedback
Motion planner generates safe quadrotor trajectories
🔎 Similar Papers
Yifan Lin
Yifan Lin
Google
Machine LearningDistributed systemsAdvertising
S
Sophie Ziyu Liu
Division of Engineering Science, University of Toronto
Ran Qi
Ran Qi
Division of Engineering Science, University of Toronto
G
George Z. Xue
Division of Engineering Science, University of Toronto
X
Xinping Song
Division of Engineering Science, University of Toronto
C
Chao Qin
Institute for Aerospace Studies, University of Toronto
H
Hugh H.-T. Liu
Institute for Aerospace Studies, University of Toronto