EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

📅 2026-06-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing language-audio models struggle to focus on relevant audio segments in complex audio-based question answering and lack interpretable, verifiable reasoning processes. This work proposes the first modular agent framework that supports the construction of audio evidence chains and self-verification, reframing the question-answering task as a collaborative pipeline involving planning, tool invocation, evidence integration, and answer validation. By integrating reinforcement learning, tool-augmented prompting, and a multi-stage evidence integration mechanism, the proposed approach significantly outperforms current baselines on the MMAR benchmark. Ablation studies confirm that evidence integration is the key driver behind the observed performance gains.

📝 Abstract

While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

Problem

Research questions and friction points this paper is trying to address.

audio reasoning

question answering

evidence integration

reasoning process

audio segments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evidence Chain Orchestration

Audio Reasoning

Modular Agent Framework