PresentAgent: Multimodal Agent for Presentation Video Generation

📅 2025-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automatically generating human-like, audiovisually synchronized presentation videos from long documents. We propose an end-to-end multimodal generation framework comprising a modular pipeline: large language models perform segment-wise document understanding and content distillation; text-to-speech, visual generation, and audiovisual alignment modules jointly produce讲解-style videos with precise image-text matching, semantic coherence, and temporal accuracy. To enable rigorous evaluation, we introduce PresentEval—a unified benchmark that quantitatively assesses content fidelity, visual clarity, and audience comprehension—marking the first such framework for presentation video generation. Experiments on 30 real-world document–presentation pairs demonstrate that our generated videos approach human-produced quality across multiple metrics, significantly advancing the conversion of lengthy textual content into accessible, highly engaging dynamic presentations.

Technology Category

Application Category

📝 Abstract
We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at https://github.com/AIGeeksGroup/PresentAgent.
Problem

Research questions and friction points this paper is trying to address.

Transforms long documents into narrated presentation videos
Generates synchronized visual and spoken content like humans
Evaluates videos for content fidelity and audience comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular pipeline for synchronized audiovisual content
LLM and TTS for contextual spoken narration
Vision-Language Model for unified video assessment
Jingwei Shi
Jingwei Shi
Shanghai University of Finance and Economics
Deep LearningLLMMLLMAgent
Z
Zeyu Zhang
AI Geeks, Australia
B
Biao Wu
Australian Artificial Intelligence Institute, Australia
Y
Yanjie Liang
AI Geeks, Australia
Meng Fang
Meng Fang
University of Liverpool
Natural Language ProcessingReinforcement LearningAgentsArtificial intelligence
L
Ling Chen
Australian Artificial Intelligence Institute, Australia
Y
Yang Zhao
La Trobe University, Australia