AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Current open-source voice agents lack unified support for multi-turn spoken dialogue, tool invocation, and reasoning-based decision-making. To address this, we propose the first open-source, voice-native agent featuring a modular architecture that enables end-to-end spoken input → reasoning → tool invocation → spoken output闭环. Our approach defines tool interfaces via natural language, models actions through abstraction, and orchestrates cascaded open-source ASR, LLM, and TTS models—enhanced by prompt engineering for dynamic reasoning and tool composition. Evaluated on VoiceBench, our agent achieves 92.75% accuracy; it scores 4.39 on AlpacaEval and attains a 90% task success rate in human evaluation—approaching the performance of proprietary systems. This work establishes a scalable, customizable benchmark framework for open-domain voice agents.

Technology Category

Application Category

📝 Abstract

Despite advances in language and speech technologies, no open-source system enables full speech-to-speech, multi-turn dialogue with integrated tool use and agentic reasoning. We introduce AURA (Agent for Understanding, Reasoning, and Automated Tool Use), the first open-source, speech-native assistant capable of completing complex, goal-driven tasks through dynamic tool invocation and multi-turn conversation. AURA combines open-weight ASR, TTS, and LLMs in a cascaded pipeline and supports tools such as calendar booking, contact lookup, web search, and email. Its modular design allows easy integration of new tools using natural language prompts and action classes. On VoiceBench, AURA scores 92.75% on OpenBookQA-outperforming all open-weight systems and nearing GPT-4o-and 4.39 on AlpacaEval, competitive with other open-weight systems. Human evaluation shows 90% task success on complex, multi-turn speech tasks.

Problem

Research questions and friction points this paper is trying to address.

Lack of open-source speech-to-speech multi-turn dialogue system

Need for integrated tool use in voice-driven tasks

Absence of agentic reasoning in speech-native assistants

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source speech-native assistant with tool use

Cascaded pipeline of ASR, TTS, and LLMs

Modular design for easy tool integration

🔎 Similar Papers

ToolGen: Unified Tool Retrieval and Calling via Generation