AudioToolAgent: An Agentic Framework for Audio-Language Models

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large audio-language models (LALMs) lack multi-step reasoning and tool-calling capabilities. This paper proposes an audio-language collaborative framework centered on a large language model (LLM) as the orchestrating agent, which dynamically invokes lightweight, multimodal tools—including ASR and audio understanding modules—via modular tool adapters. The framework supports iterative questioning, output verification, and interpretable configuration optimization. Innovatively, it integrates Monte Carlo sampling with Shapley value analysis to quantify individual tool contributions, enabling zero-shot modular tool composition and performance attribution without fine-tuning. Evaluated on the MMAU, MMAR, and MMAU-Pro benchmarks, our approach achieves state-of-the-art accuracy of 74.10%, 68.80%, and 57.96%, respectively—all without additional training or annotated data.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Monte Carlo sampling for shapley values across 374 configurations identifies effective agent-tool combinations. The modular design allows integration of new tools and eliminates the use of data and training costs. Code and reproduction materials are available at: github.com/GLJS/AudioToolAgent
Problem

Research questions and friction points this paper is trying to address.

Addressing multi-step reasoning gaps in audio-language models
Enhancing tool-calling capabilities for audio question answering
Eliminating training costs through modular tool integration framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses central LLM agent to coordinate audio tools
Selects tools and compares outputs for verification
Modular design integrates new tools without training
🔎 Similar Papers
No similar papers found.