EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

📅 2025-04-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-driven full-body motion generation methods lack semantic awareness in keyframe mask selection, leading to motion reconstruction artifacts. To address this, we propose the Speech-Query-based Attention Masking (SQA) framework, which jointly models semantic and rhythmic cues via a learned speech-query mechanism and a motion-speech aligned latent space, enabling dynamic, frame-level masking. We further introduce a Motion-Audio Alignment Module (MAM) and cross-modal feature projection to achieve end-to-end fusion of speech-guided signals. Our approach significantly improves motion naturalness, rhythm synchronization, and semantic fidelity. Extensive quantitative evaluations on multiple benchmarks demonstrate consistent superiority over state-of-the-art methods, while qualitative analysis confirms substantial perceptual improvements in motion coherence and speech-movement alignment.

Technology Category

Application Category

📝 Abstract
Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.
Problem

Research questions and friction points this paper is trying to address.

Identifies significant frames for motion masking
Generates co-speech motion using speech features
Improves motion generation quality with selective masking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-queried attention-based mask modeling
Motion-audio alignment module for joint space
Selective masking guided by attention scores
🔎 Similar Papers
No similar papers found.
X
Xiangyue Zhang
Wuhan University, Tongyi Lab, Alibaba Group
J
Jianfang Li
Tongyi Lab, Alibaba Group
Jiaxu Zhang
Jiaxu Zhang
Wuhan University
computer visiongenerative AI2D/3D character animationMLLM
J
Jianqiang Ren
Tongyi Lab, Alibaba Group
Liefeng Bo
Liefeng Bo
Head of Applied Computer Vision Lab at Alibaba Group
Machine LearningComputer VisionRobotics
Z
Zhigang Tu
Wuhan University