Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the absence of reference-free mask quality assessment methods for language-guided audio-visual segmentation. It introduces, for the first time, the MQA-RefAVS task, which defines and implements reference-free evaluation of segmentation masks in this multimodal setting. The proposed approach leverages a multimodal large language model (MLLM) to jointly integrate audio, video, text, and mask inputs, enabling explicit reasoning to predict IoU scores, identify geometric and semantic error types, and provide actionable suggestions for quality improvement. To support this task, the authors construct MQ-RAVSBench, a comprehensive benchmark featuring diverse mask errors, and propose the MQ-Auditor architecture. Experiments demonstrate that MQ-Auditor outperforms existing open-source and commercial MLLMs in assessment accuracy, effectively detects segmentation failures, and enhances downstream segmentation performance.

Technology Category

Application Category

📝 Abstract

Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA-RefAVS.

Problem

Research questions and friction points this paper is trying to address.

language-referred audio-visual segmentation

mask quality assessment

reference-free evaluation

segmentation error diagnosis

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

reference-free mask quality assessment

language-referred audio-visual segmentation

multimodal large language model

mask error diagnosis

MQ-RAVSBench

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

2024-06-04arXiv.orgCitations: 0