🤖 AI Summary
This work addresses the limitations of existing audio instruction-following evaluation methods, which struggle to disentangle complex instructions, lack interpretability, and fail to detect fine-grained attribute mismatches. To overcome these challenges, the authors propose a dynamic scoring rule paradigm that adaptively decomposes intricate audio descriptions into multiple verifiable binary scoring items. They introduce AnyAudio-Judge Bench, a bilingual benchmark, along with a dedicated evaluation model trained on hard negative samples, multi-domain data, and large-scale chain-of-thought annotated corpora. The model is optimized via supervised fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO), enabling fine-grained and interpretable assessment of audio-instruction alignment. The proposed approach substantially outperforms existing baselines and delivers precise reward signals in zero-shot settings, effectively enhancing instruction adherence in audio generation.
📝 Abstract
The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.