🤖 AI Summary
Current video generation models face critical limitations—including narrow context windows, restricted output duration, monolithic stylistic outputs, and imprecise domain-knowledge representation—when converting academic papers into structured video summaries. To address these challenges, we propose the first agent-based system specifically designed for the “paper-to-video” task, adopting a two-stage paradigm that synergistically integrates top-down content decomposition with bottom-up clip generation. Our approach innovatively introduces key-scene definition and Progressive Chain-of-Thought (P-CoT) reasoning to enable fine-grained cross-modal alignment and accurate domain-specific knowledge modeling. The system unifies large language model–driven reasoning, multi-granularity summarization, controllable video generation, and compositional synthesis, supporting end-to-end task planning and content orchestration. Evaluated across five academic disciplines, our generated video summaries demonstrate statistically significant improvements over baselines in domain expertise, narrative coherence, and stylistic diversity.
📝 Abstract
The paper-to-video task converts a research paper into a structured video abstract, distilling key concepts, methods, and conclusions into an accessible, well-organized format. While state-of-the-art video generation models demonstrate potential, they are constrained by limited context windows, rigid video duration constraints, limited stylistic diversity, and an inability to represent domain-specific knowledge. To address these limitations, we introduce Preacher, the first paper-to-video agentic system. Preacher employs a top- down approach to decompose, summarize, and reformulate the paper, followed by bottom-up video generation, syn- thesizing diverse video segments into a coherent abstract. To align cross-modal representations, we define key scenes and introduce a Progressive Chain of Thought (P-CoT) for granular, iterative planning. Preacher successfully gener- ates high-quality video abstracts across five research fields, demonstrating expertise beyond current video generation models. Code will be released at: https://github.com/Gen- Verse/Paper2Video