π€ AI Summary
This work addresses the susceptibility of multimodal large language models to visual-textual conflicts in automated evaluation, where they often reward responses that are semantically plausible yet perceptually inaccurate, leading to unreliable assessments. The study introduces the first systematic formulation and solution to perceptual judgment bias in multimodal critique. It proposes a unified training framework that operates without explicit pairwise labels by generating a perceptually perturbed judgment dataset through controlled visual perturbations, combined with counterfactual response construction, structured GRPO reinforcement learning, and batch-wise ranking optimization. This approach yields a critique model that is both perceptually consistent and verifiable. Experiments demonstrate significant improvements across multiple benchmarks in perceptual fidelity, ranking consistency, and correlation with human judgments, confirming the methodβs scalability and robustness.
π Abstract
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.