CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work challenges the conventional turn-based paradigm in spoken dialogue systems by proposing an end-to-end voice-to-voice conversational framework with human-level proactivity and naturalness. Methodologically, it introduces a Subjective Action Judgment module enabling real-time interruption handling and system-initiated utterances; designs five anthropomorphic response strategies—interruption, rejection, topic shifting, silence, and standard reply—and integrates ASR, LLM, and TTS within a unified single-file architecture leveraging WebSocket-based full-duplex communication and non-blocking I/O for ultra-low latency. A dynamic memory mechanism and Action Judgment Supervised Fine-Tuning (SFT) further enhance contextual awareness. The key contribution is the first open-source, scalable, full-duplex proactive speech interaction framework achieving sub-500ms transition latency—significantly improving dialogue naturalness and initiative. The implementation is publicly released.

Technology Category

Application Category

📝 Abstract

CleanS2S is a framework for human-like speech-to-speech interaction that advances conversational AI through single-file implementation and proactive dialogue capabilities. Our system integrates automatic speech recognition, large language models, and text-to-speech synthesis into a unified pipeline with real-time interruption handling, achieving low transition latency through full-duplex websocket connections and non-blocking I/O. Beyond conventional chatbot paradigms, we pioneer a proactive interaction mechanism, which combines memory systems with Subjective Action Judgement module, enabling five human-like response strategies: interruption, refusal, deflection, silence, and standard response. The memory module dynamically aggregates historical, and contextual data to inform interaction decisions. This approach breaks the rigid turn-based convention by allowing system-initiated dialog control and context-aware response selection. And we propose Action Judgement SFT that assesses input streams for responses strategies. The framework's single-file implementation with atomic configurations offers researchers unprecedented transparency and extensibility for interaction agents. The code of CleanS2S is released at https://github.com/opendilab/CleanS2S.

Problem

Research questions and friction points this paper is trying to address.

Develops proactive speech-to-speech interaction framework

Integrates real-time interruption handling and low latency

Enables human-like response strategies and dynamic memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-file framework for speech-to-speech interaction

Proactive dialogue with human-like response strategies

Real-time interruption handling via full-duplex websocket

🔎 Similar Papers

No similar papers found.