🤖 AI Summary
Existing full-duplex speech systems rely on modular architectures, leading to error propagation and suboptimal performance in challenging scenarios such as contextual interruptions and echo cancellation. Although prior work attempts to inject audio codecs into LLM token spaces, significant quality degradation persists in the speech modality. This paper introduces the first codec-free monolithic full-duplex speech large language model, enabling real-time, end-to-end listening-to-speaking switching directly in the speech modality. Our core contributions are: (1) a novel dynamic thinking mechanism that enables autonomous, fine-grained transitions between listening and speaking states during speech input; and (2) joint optimization of dynamic state modeling—built upon an LLM backbone—with end-to-end speech representation learning and reinforcement-learning-driven dialogue policy. On spoken QA and open-domain dialogue benchmarks, our method surpasses open-source SOTA by over 30%, and demonstrates marked superiority in complex scenarios including turn-taking, echo cancellation, and feedback generation.
📝 Abstract
In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.