MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the susceptibility of multimodal large language models (MLLMs) to compositional bias when employed as automatic evaluators, leading to unstable judgments under missing, mismatched, or perturbed visual and textual cues. The study presents the first systematic definition and quantification of compositional bias in MLLM-as-a-Judge settings, introducing a fine-grained diagnostic framework encompassing nine bias categories. It constructs a high-quality evaluation benchmark featuring controlled perturbations across queries, images, and responses, integrating over 1,800 samples from 29 source datasets. To assess model sensitivity and stability, the authors propose two complementary metrics: Bias-Deviation and Bias-Conformity. Extensive experiments on 26 state-of-the-art MLLMs reveal pervasive tendencies toward modality neglect and asymmetric evaluation behavior, demonstrating the benchmark’s effectiveness in diagnosing model reliability.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.

Problem

Research questions and friction points this paper is trying to address.

Compositional Bias

MLLM-as-a-Judge

Multimodal Large Language Models

Evaluation Reliability

Bias in AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compositional Bias

MLLM-as-a-Judge

MM-JudgeBias