Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic ambiguity and motion detail distortion in text-to-human motion generation caused by the lack of frequency-domain modeling, this paper proposes a two-stage diffusion framework: (1) a semantic planning stage that enhances robustness of static structural priors, and (2) a fine-grained refinement stage dedicated to preserving high-frequency motion dynamics. We introduce, for the first time, a frequency-enhancement mechanism and stage-specific consistency losses to decouple semantic structure modeling from dynamic motion detail modeling, thereby overcoming inherent limitations of purely time-domain approaches. The method integrates frequency-domain feature extraction, dual-stage noise scheduling, and semantic-motion consistency constraints, built upon the StableMoFusion pretrained architecture for efficient optimization. On the StableMoFusion benchmark, our approach achieves a new state-of-the-art FID of 0.051—substantially lower than the prior best of 0.189—demonstrating significant improvements in both semantic fidelity and motion realism.

Technology Category

Application Category

📝 Abstract
Rapid progress in text-to-motion generation has been largely driven by diffusion models. However, existing methods focus solely on temporal modeling, thereby overlooking frequency-domain analysis. We identify two key phases in motion denoising: the **semantic planning stage** and the **fine-grained improving stage**. To address these phases effectively, we propose **Fre**quency **e**nhanced **t**ext-**to**-**m**otion diffusion model (**Free-T2M**), incorporating stage-specific consistency losses that enhance the robustness of static features and improve fine-grained accuracy. Extensive experiments demonstrate the effectiveness of our method. Specifically, on StableMoFusion, our method reduces the FID from **0.189** to **0.051**, establishing a new SOTA performance within the diffusion architecture. These findings highlight the importance of incorporating frequency-domain insights into text-to-motion generation for more precise and robust results.
Problem

Research questions and friction points this paper is trying to address.

MotionSynthesis
FrequencyAnalysis
StabilityRetention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Free-T2M
Frequency-enhanced Text-to-Motion
StableMoFusion Superiority
🔎 Similar Papers
No similar papers found.