🤖 AI Summary
Existing rigid motion transfer methods rely on geometric, generative, or physical priors, compromising generalizability and temporal coherence. This paper proposes a zero-shot monocular video-to-single-view image rigid motion transfer framework. Our core innovation is the construction of an *internally shared spatiotemporal transformation prior*: we decouple motion and geometric semantics via 3D spatial mapping, enforce spatiotemporal consistency through learnable positional encoding, model controllable velocity fields, and perform position-aware dynamical optimization—all without external supervision. This enables cross-object, zero-shot motion transfer. Experiments demonstrate that our method generates high-fidelity, temporally coherent motion videos across diverse object categories, significantly improving visual consistency and inference efficiency compared to prior approaches.
📝 Abstract
We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.