MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the tendency of existing text-to-motion generation models to overlook non-dominant actions in compound motion descriptions, often resulting in semantically incomplete outputs. To mitigate this issue without requiring model retraining, the authors propose a lightweight inference-time framework that operates atop pretrained models. The approach introduces a prompt-aware decision module that dynamically identifies textual tokens and network layers requiring emphasis, then adaptively amplifies their cross-attention scores to enable customized attention guidance. This mechanism effectively alleviates semantic collapse, consistently outperforming current baselines across diverse compound prompts while simultaneously enhancing both semantic coverage and motion naturalness.

📝 Abstract

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.

Problem

Research questions and friction points this paper is trying to address.

text-to-motion generation

composite prompts

multi-action synthesis

semantic collapse

attention guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-to-motion

composite prompts

attention guidance