🤖 AI Summary
Existing language-audio models struggle to focus on relevant audio segments in complex audio-based question answering and lack interpretable, verifiable reasoning processes. This work proposes the first modular agent framework that supports the construction of audio evidence chains and self-verification, reframing the question-answering task as a collaborative pipeline involving planning, tool invocation, evidence integration, and answer validation. By integrating reinforcement learning, tool-augmented prompting, and a multi-stage evidence integration mechanism, the proposed approach significantly outperforms current baselines on the MMAR benchmark. Ablation studies confirm that evidence integration is the key driver behind the observed performance gains.
📝 Abstract
While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.