Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the unnatural interaction behaviors—such as excessively long silences and improper turn-taking—that commonly arise in full-duplex spoken dialogue systems due to their reliance on token-level likelihood optimization. To overcome these limitations, the authors propose a reinforcement learning–based post-training alignment method that systematically encompasses four key interactive dimensions: pause handling, turn-taking, backchannel feedback, and user interruption. The approach introduces an audio segment–level reward function and incorporates a large language model to constrain semantic quality. Experimental results on the Moshi and PersonaPlex benchmarks demonstrate that the method significantly enhances both offline and real-time multi-turn dialogue fluency and naturalness while preserving response accuracy.

📝 Abstract

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

Problem

Research questions and friction points this paper is trying to address.

full-duplex speech

interactivity

turn-taking

reinforcement learning

spoken dialogue systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-duplex speech

interactivity alignment

reinforcement learning