Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

End-to-end speech dialogue systems suffer from factual hallucinations due to insufficient external knowledge, while conventional tool-augmented approaches incur substantial latency, compromising real-time interactivity and fluency. To address this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), the first framework integrating tool invocation directly into an end-to-end speech-in–speech-out architecture: it concurrently predicts and triggers tool queries *during* user speech streaming—enabling “query-while-speaking.” We optimize tool-trigger timing via post-training, jointly modeling acoustic features and retrieval outputs to generate natural spoken responses. Additionally, we introduce AudioCRAG, the first benchmark specifically designed for speech-centric conversational RAG evaluation. Experiments demonstrate a 23.1-percentage-point absolute gain in question-answering accuracy (+200% relative improvement), a 20% reduction in tool invocation latency, and full compatibility with text input.

Technology Category

Application Category

📝 Abstract

End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines, generating more natural, expressive responses with significantly lower latency. However, these systems remain prone to hallucinations due to limited factual grounding. While text-based dialogue systems address this challenge by integrating tools such as web search and knowledge graph APIs, we introduce the first approach to extend tool use directly into speech-in speech-out systems. A key challenge is that tool integration substantially increases response latency, disrupting conversational flow. To mitigate this, we propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech, even before the user finishes speaking. Specifically, we develop a post-training pipeline that teaches the model when to issue tool calls during ongoing speech and how to generate spoken summaries that fuse audio queries with retrieved text results, thereby improving both accuracy and responsiveness. To evaluate our approach, we construct AudioCRAG, a benchmark created by converting queries from the publicly available CRAG dataset into speech form. Experimental results demonstrate that our streaming RAG approach increases QA accuracy by up to 200% relative (from 11.1% to 34.2% absolute) and further enhances user experience by reducing tool use latency by 20%. Importantly, our streaming RAG approach is modality-agnostic and can be applied equally to typed input, paving the way for more agentic, real-time AI assistants.

Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in speech-in speech-out dialogue systems

Integrating tool usage with low latency for conversational flow

Improving accuracy and responsiveness through streaming retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts tool queries during ongoing user speech

Generates spoken summaries fusing audio with text results

Reduces latency by 20% while improving QA accuracy

🔎 Similar Papers

No similar papers found.