🤖 AI Summary
Existing text-to-video (T2V) diffusion models suffer from significant deficiencies in physical commonsense adherence and temporal dynamics modeling, primarily due to limited physical understanding and inadequate sequential modeling. Current approaches either rely on large-scale annotated datasets or require dedicated physics modules, resulting in poor generalizability. This paper proposes a **data-agnostic, model-agnostic, LLM-guided prompt enhancement framework** that enables zero-shot physically plausible video generation. Our method integrates chain-of-thought (CoT) and step-back reasoning, physics-rule-constrained prompt engineering, and native diffusion-model prompt optimization. Crucially, it requires no additional training data or architectural modifications and supports out-of-distribution generalization to novel physical scenarios. Experiments demonstrate a 2.3× improvement in physical consistency over baseline T2V models and a 35% gain over prior prompt-based methods. The code is publicly available.
📝 Abstract
Text-to-video (T2V) generation has been recently enabled by transformer-based diffusion models, but current T2V models lack capabilities in adhering to the real-world common knowledge and physical rules, due to their limited understanding of physical realism and deficiency in temporal modeling. Existing solutions are either data-driven or require extra model inputs, but cannot be generalizable to out-of-distribution domains. In this paper, we present PhyT2V, a new data-independent T2V technique that expands the current T2V model's capability of video generation to out-of-distribution domains, by enabling chain-of-thought and step-back reasoning in T2V prompting. Our experiments show that PhyT2V improves existing T2V models' adherence to real-world physical rules by 2.3x, and achieves 35% improvement compared to T2V prompt enhancers. The source codes are available at: https://github.com/pittisl/PhyT2V.