ViMax: Agentic Video Generation

πŸ“… 2026-06-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing long-form video generation methods lack systematic narrative planning and cross-scene visual consistency, struggling to maintain coherent character and environmental representations across multiple scenes. This work proposes the first multi-agent collaborative framework for long video generation, in which specialized agents jointly negotiate narrative structure, visual continuity, and production quality. The approach integrates a hierarchical narrative engine with a dependency-aware, cross-temporal tracking mechanism for characters and environments, further enhanced by retrieval-augmented generation and vision-language model–guided monitoring optimization. This method substantially improves narrative coherence and cross-scene visual consistency, significantly outperforming current short-clip generation techniques in both story structure and visual fidelity.
πŸ“ Abstract
Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.
Problem

Research questions and friction points this paper is trying to address.

long-form video generation
narrative planning
visual consistency
character consistency
environmental consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic video generation
narrative coherence
visual consistency
retrieval-augmented generation
multi-agent collaboration
πŸ”Ž Similar Papers
No similar papers found.