StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses stylized human motion generation driven by multimodal inputs—text, image, video, audio, and motion—aiming to jointly optimize content diversity and style fidelity. We propose the first unified framework supporting joint conditioning from all five modalities. Its core innovation is a style-content cross-modal fusion mechanism: leveraging a latent diffusion model (LDM), we integrate pretrained multimodal encoders (e.g., CLIP, AudioMAE), a dedicated style encoder, and a cross-modal feature disentanglement–fusion module to precisely extract stylistic cues and inject them content-independently. The method enables zero-shot style transfer and hybrid multimodal co-conditioning. Evaluated on multiple benchmarks, it surpasses state-of-the-art methods, achieving significant improvements in cross-modal style generalization and fine-grained motion control accuracy.

Technology Category

Application Category

📝 Abstract

We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: https://stylemotif.github.io

Problem

Research questions and friction points this paper is trying to address.

Generates motion combining content and multi-modal style

Enhances motion synthesis with style-content cross fusion

Improves realism in multi-modal motion stylization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Style-content cross fusion for motion generation

Multi-modal style encoder alignment

Latent Diffusion model for stylized motion

🔎 Similar Papers

No similar papers found.

Authors to Follow