π€ AI Summary
This work addresses the susceptibility of multimodal large language models (MLLMs) to compositional bias when employed as automatic evaluators, leading to unstable judgments under missing, mismatched, or perturbed visual and textual cues. The study presents the first systematic definition and quantification of compositional bias in MLLM-as-a-Judge settings, introducing a fine-grained diagnostic framework encompassing nine bias categories. It constructs a high-quality evaluation benchmark featuring controlled perturbations across queries, images, and responses, integrating over 1,800 samples from 29 source datasets. To assess model sensitivity and stability, the authors propose two complementary metrics: Bias-Deviation and Bias-Conformity. Extensive experiments on 26 state-of-the-art MLLMs reveal pervasive tendencies toward modality neglect and asymmetric evaluation behavior, demonstrating the benchmarkβs effectiveness in diagnosing model reliability.
π Abstract
Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.