🤖 AI Summary
Existing LLM-based peer review research lacks a unified, multimodal evaluation benchmark, hindering rigorous assessment of models’ ability to generate comprehensive, accurate, and human-aligned review comments—particularly when papers contain figures and tables. To address this, we introduce MMReview, the first interdisciplinary, multimodal peer review benchmark. It comprises 240 papers across four disciplines and 17 subfields, each accompanied by associated figures/tables and expert-written reviews. MMReview defines 13 evaluation tasks spanning review generation, decision prediction, human preference alignment, and robustness against adversarial perturbations. We conduct systematic evaluation on 16 open-source and 5 closed-source LLMs, demonstrating MMReview’s strong discriminative power and substantial challenge. By bridging the gap in multimodal, cross-disciplinary automated review evaluation, MMReview establishes a foundational infrastructure for standardizing and advancing research in AI-assisted scientific peer review.
📝 Abstract
With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose extbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.