MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

πŸ“… 2025-12-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Despite significant advances in photorealism, text-to-video (T2V) generation remains severely deficient in physical plausibility and motion-logical consistency. To address this, we propose MoReGenβ€”a novel multi-agent collaborative framework centered on *motion reasoning*: it orchestrates large language models, physics simulation engines, and renderers via code-based prompts, enabling end-to-end action planning and video synthesis within a unified code space. We formalize *object trajectory consistency* as a quantitative metric for physical validity and introduce MoReSet, a benchmark comprising 1,275 finely annotated videos. Extensive experiments reveal that state-of-the-art T2V models exhibit weak physical consistency; in contrast, MoReGen substantially improves motion coherence and adherence to Newtonian mechanics. Our work establishes a new methodology for *reasoning-aware*, *reproducible*, and *physics-aligned* video generation, accompanied by the first systematic evaluation infrastructure for physical plausibility in T2V.

Technology Category

Application Category

πŸ“ Abstract
While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.
Problem

Research questions and friction points this paper is trying to address.

Generating physically accurate videos from text prompts
Evaluating physical validity in text-to-video synthesis
Addressing motion coherence in Newtonian phenomena videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent LLMs integrate physics simulators and renderers
Generates code-based videos with object-trajectory correspondence
Evaluates physical validity using a benchmark of Newtonian phenomena
πŸ”Ž Similar Papers
No similar papers found.