AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current open-source voice agents lack unified support for multi-turn spoken dialogue, tool invocation, and reasoning-based decision-making. To address this, we propose the first open-source, voice-native agent featuring a modular architecture that enables end-to-end spoken input → reasoning → tool invocation → spoken output闭环. Our approach defines tool interfaces via natural language, models actions through abstraction, and orchestrates cascaded open-source ASR, LLM, and TTS models—enhanced by prompt engineering for dynamic reasoning and tool composition. Evaluated on VoiceBench, our agent achieves 92.75% accuracy; it scores 4.39 on AlpacaEval and attains a 90% task success rate in human evaluation—approaching the performance of proprietary systems. This work establishes a scalable, customizable benchmark framework for open-domain voice agents.

Technology Category

Application Category

📝 Abstract
Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.
Problem

Research questions and friction points this paper is trying to address.

Lack of open-source speech-to-speech multi-turn dialogue system
Need for integrated tool use in voice-driven tasks
Absence of agentic reasoning in speech-native assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source speech-native assistant with tool use
Cascaded pipeline of ASR, TTS, and LLMs
Modular design for easy tool integration
🔎 Similar Papers
No similar papers found.