DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

This work addresses the challenge of real-time, synchronized, full-duplex generation and perception of speech and body motion in dyadic interactions. The authors propose a streaming full-duplex speech-motion model based on a dual-tower Transformer architecture, which freezes a pretrained speech encoder while introducing a deeply coupled motion pathway. A unified token interleaving mechanism for dyadic agents and temporally aligned speech-motion RoPE-guided cross-attention are designed to enable fine-grained coordination. This approach achieves, for the first time, a deep integration of zero-shot conversational capability with synchronized motion generation. Trained on a 4,000-hour Seamless Interaction dataset, the model attains state-of-the-art performance on both single-agent and dyadic interaction benchmarks, effectively capturing cross-speaker dependencies.

📝 Abstract

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

Problem

Research questions and friction points this paper is trying to address.

full-duplex

dyadic interaction

speech-motion synchronization

streaming multimodal model

human communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex

speech-motion synchronization

dual-tower Transformer