UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current MLLM evaluation faces two key bottlenecks: high cost of manually constructing image-question-answer triples and bias introduced by automated, single-model evaluators. This paper proposes an unsupervised evaluation framework that relies solely on image data, introducing the novel “peer-review” paradigm. First, diverse visual questions are automatically generated via unsupervised question synthesis. Second, multiple MLLMs collaboratively generate answers and cross-evaluate each other’s responses. Third, a three-dimensional scoring system is established based on answer correctness, visual reasoning capability, and image-text alignment. Evaluated on MMStar and ScienceQA, our method achieves Pearson correlation coefficients of 0.944 and 0.814 with human judgments—significantly outperforming existing automatic evaluators and closely approximating human assessment. The core contribution lies in eliminating reliance on manual annotation and single-model adjudication, enabling scalable, low-bias, and multidimensionally interpretable evaluation of MLLM visual understanding.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.
Problem

Research questions and friction points this paper is trying to address.

Reduces human workload in MLLM evaluation
Mitigates biases in automated MLLM assessments
Enhances evaluation scale and scope using image data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised peer review for MLLM evaluation
Automated question generation from image data
Vision-language scoring system reduces bias
🔎 Similar Papers
No similar papers found.