DynamicEval: Rethinking Evaluation for Dynamic Text-to-Video Synthesis

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video (T2V) evaluation benchmarks suffer from two key limitations: inadequate modeling of camera motion and overreliance on model-level aggregate scores, lacking fine-grained video-level assessment. To address this, we introduce the first T2V benchmark explicitly designed for dynamic shot evaluation, comprising 45K human-annotated samples to assess spatiotemporal consistency of both background and foreground objects under motion. Our method introduces two novel metrics: (1) a background consistency metric that corrects for occlusion using object-level error maps; and (2) a foreground consistency metric integrating point tracking with optical flow analysis. Additionally, we propose interpretable error maps to enhance evaluation of motion smoothness. Experiments demonstrate that our metrics significantly improve correlation with human preferences—by +2.1% at both video- and model-level—thereby advancing the accuracy and interpretability of dynamic T2V generation evaluation.

Technology Category

Application Category

📝 Abstract
Existing text-to-video (T2V) evaluation benchmarks, such as VBench and EvalCrafter, suffer from two limitations. (i) While the emphasis is on subject-centric prompts or static camera scenes, camera motion essential for producing cinematic shots and existing metrics under dynamic motion are largely unexplored. (ii) These benchmarks typically aggregate video-level scores into a single model-level score for ranking generative models. Such aggregation, however, overlook video-level evaluation, which is vital to selecting the better video among the candidate videos generated for a given prompt. To address these gaps, we introduce DynamicEval, a benchmark consisting of systematically curated prompts emphasizing dynamic camera motion, paired with 45k human annotations on video pairs from 3k videos generated by ten T2V models. DynamicEval evaluates two key dimensions of video quality: background scene consistency and foreground object consistency. For background scene consistency, we obtain the interpretable error maps based on the Vbench motion smoothness metric. We observe that while the Vbench motion smoothness metric shows promising alignment with human judgments, it fails in two cases: occlusions/disocclusions arising from camera and foreground object movements. Building on this, we propose a new background consistency metric that leverages object error maps to correct two failure cases in a principled manner. Our second innovation is the introduction of a foreground consistency metric that tracks points and their neighbors within each object instance to assess object fidelity. Extensive experiments demonstrate that our proposed metrics achieve stronger correlations with human preferences at both the video level and the model level (an improvement of more than 2% points), establishing DynamicEval as a more comprehensive benchmark for evaluating T2V models under dynamic camera motion.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in existing text-to-video evaluation benchmarks
Proposes metrics for background and foreground consistency under dynamic motion
Introduces DynamicEval with human annotations for comprehensive model evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DynamicEval benchmark with dynamic camera motion prompts
Proposes background consistency metric using object error maps
Develops foreground consistency metric tracking object instance points
🔎 Similar Papers
No similar papers found.