🤖 AI Summary
Existing large audio-language models (LALMs) lack multi-step reasoning and tool-calling capabilities. This paper proposes an audio-language collaborative framework centered on a large language model (LLM) as the orchestrating agent, which dynamically invokes lightweight, multimodal tools—including ASR and audio understanding modules—via modular tool adapters. The framework supports iterative questioning, output verification, and interpretable configuration optimization. Innovatively, it integrates Monte Carlo sampling with Shapley value analysis to quantify individual tool contributions, enabling zero-shot modular tool composition and performance attribution without fine-tuning. Evaluated on the MMAU, MMAR, and MMAU-Pro benchmarks, our approach achieves state-of-the-art accuracy of 74.10%, 68.80%, and 57.96%, respectively—all without additional training or annotated data.
📝 Abstract
Large Audio-Language Models (LALMs) perform well on audio understanding tasks but lack multi-step reasoning and tool-calling found in recent Large Language Models (LLMs). This paper presents AudioToolAgent, a framework that coordinates audio-language models as tools via a central LLM agent that accesses tool adapters for audio question answering and speech-to-text. The agent selects tools, asks follow-up questions, and compares outputs for verification. Experiments with MMAU, MMAR, and MMAU-Pro show state-of-the-art accuracy: up to 74.10% on MMAU, 68.80% on MMAR, and 57.96% on MMAU-Pro. Monte Carlo sampling for shapley values across 374 configurations identifies effective agent-tool combinations. The modular design allows integration of new tools and eliminates the use of data and training costs. Code and reproduction materials are available at: github.com/GLJS/AudioToolAgent