🤖 AI Summary
Existing image-to-video generation methods often struggle to produce controllable and realistic object motion due to a lack of physical plausibility and depth awareness. This work proposes PhysLayer, a novel framework that introduces depth-aware layered physical simulation into image animation for the first time. Guided by textual instructions, PhysLayer decomposes a static image into depth layers, extends 2D rigid-body dynamics to support depth-aware motion, and synthesizes videos through trajectory-based simulation coupled with scene-aware relighting. Notably, it achieves perspective-consistent and physically plausible animations without requiring full 3D reconstruction. Experiments demonstrate that PhysLayer outperforms baseline methods in CLIP similarity (+2.2%), FID (+9.3%), and Motion-FID (+3%), while human evaluations reveal a 24% improvement in physical plausibility and a 35% gain in text-video alignment.
📝 Abstract
Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2\%), FID score (+9.3\%), and Motion-FID (+3\%), with human evaluation showing enhanced physical plausibility (+24\%) and text-video alignment (+35\%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.