TSTMotion: Training-free Scene-awarenText-to-motion Generation

📅 2025-05-02
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing scene-aware text-to-motion generation methods rely on large-scale real motion datasets annotated with scene information, incurring high acquisition costs and exhibiting poor generalization. This paper introduces the first zero-shot, fine-tuning-free scene-aware framework that requires no paired real motion data conditioned on scenes: given only a 3D scene and a textual description, it synthesizes human motion sequences geometrically and semantically consistent with the environment. Our method leverages collaborative reasoning among multimodal foundation models to generate verifiable motion guidance signals; these are integrated via motion-guided injection and latent-space reparameterization to enable zero-shot scene-constrained modeling. Experiments on diverse complex 3D scenes demonstrate substantial improvements in motion plausibility and scene alignment. Moreover, our framework supports plug-and-play integration with mainstream text-to-motion models. Code is publicly available.

Technology Category

Application Category

📝 Abstract
Text-to-motion generation has recently garnered significant research interest, primarily focusing on generating human motion sequences in blank backgrounds. However, human motions commonly occur within diverse 3D scenes, which has prompted exploration into scene-aware text-to-motion generation methods. Yet, existing scene-aware methods often rely on large-scale ground-truth motion sequences in diverse 3D scenes, which poses practical challenges due to the expensive cost. To mitigate this challenge, we are the first to propose a extbf{T}raining-free extbf{S}cene-aware extbf{T}ext-to- extbf{Motion} framework, dubbed as extbf{TSTMotion}, that efficiently empowers pre-trained blank-background motion generators with the scene-aware capability. Specifically, conditioned on the given 3D scene and text description, we adopt foundation models together to reason, predict and validate a scene-aware motion guidance. Then, the motion guidance is incorporated into the blank-background motion generators with two modifications, resulting in scene-aware text-driven motion sequences. Extensive experiments demonstrate the efficacy and generalizability of our proposed framework. We release our code in href{https://tstmotion.github.io/}{Project Page}.
Problem

Research questions and friction points this paper is trying to address.

Generating human motions in 3D scenes without training
Overcoming reliance on costly ground-truth motion data
Enhancing blank-background motion models with scene-awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free scene-aware text-to-motion framework
Uses foundation models for motion guidance
Modifies pre-trained blank-background motion generators
🔎 Similar Papers
No similar papers found.
Z
Ziyan Guo
Singapore University of Technology and Design, Singapore
Haoxuan Qu
Haoxuan Qu
Lancaster University
computer vision
Hossein Rahmani
Hossein Rahmani
Professor, Lancaster University
Computer VisionMachine LearningVideo AnalysisAction RecognitionHuman Behavior Analysis
D
D. Soh
Singapore University of Technology and Design, Singapore
Ping Hu
Ping Hu
UESTC
Computer VisionDeep LearningImage/Video Processing
Q
Qi Ke
Monash University, Victoria, Australia
J
Jun Liu
Lancaster University, Lancaster, England