LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

πŸ“… 2025-08-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In text-to-video (T2V) generation, controlling multi-object motion trajectories faces two key challenges: (1) existing models and datasets are predominantly designed for single-object scenarios, limiting their capacity to model complex inter-object interactions; and (2) trajectory intersections induce semantic conflicts, severely degrading generation quality. To address these, we propose LayerT2Vβ€”the first T2V framework incorporating a hierarchical compositing mechanism. It decomposes video generation into a static background and multiple foreground layers, each corresponding to a distinct object and independently controllable. Motion trajectories are generated layer-by-layer, with explicit modeling of inter-object trajectory interactions. This design eliminates semantic ambiguity in occluded or intersecting regions and enables fine-grained, object-level motion editing. Evaluated on multi-object generation tasks, LayerT2V significantly improves spatiotemporal consistency, outperforming state-of-the-art methods by 1.4Γ— in mean Intersection-over-Union (mIoU) and 4.5Γ— in Average Precision at IoU=0.5 (AP50).

Technology Category

Application Category

πŸ“ Abstract
Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at https://kr-panghu.github.io/LayerT2V/ .
Problem

Research questions and friction points this paper is trying to address.

Control multi-object motion in Text-to-Video generation
Address semantic conflicts in intersecting object trajectories
Enable layered generation for coherent multi-object synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layered generation for multi-object video synthesis
Independent element integration on distinct layers
Improved metrics over SOTA in complex scenarios
πŸ”Ž Similar Papers
No similar papers found.