Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current AI-assisted peer review systems are vulnerable to targeted adversarial attacks when processing multimodal scientific papers containing both figures and text, yet existing robustness research remains confined to purely textual inputs and lacks effective defense mechanisms. This work proposes the first comprehensive adversarial attack and defense framework tailored for multimodal peer review, introducing PaperGuard—a benchmark suite comprising a multimodal review dataset, a cross-modal attack toolkit that integrates black-box prompt injection with white-box image perturbations, and an efficient defense strategy based on chunked embedding retrieval. Experiments demonstrate widespread vulnerabilities in mainstream AI review models, and PaperGuard establishes a foundational benchmark, protocol, and practical defense infrastructure for trustworthy, attack-resilient AI-assisted academic peer review.

📝 Abstract

The integration of Large Language Models (LLMs) and Multimodal LLMs (MLLMs) into scientific peer-review workflows introduces novel and significant risks for adversarial manipulation, especially given the multimodal nature of scientific papers where figures, not just text, convey core evidence. This creates a significant gap: current robustness studies on AI peer-review are overwhelmingly text-only. Moreover, the problem is distinct from standard jailbreaking, as a peer-review attack seeks to induce a domain-specific, targeted failure (e.g., "inflate this score") rather than a general safety policy violation, for which no practical defenses exist. To address this, we introduce PaperGuard, the first comprehensive benchmark designed to systematically evaluate and defend AI-generated peer-review against these domain-specific, cross-modal attacks. Our framework is built on three pillars: (1) a new multimodal peer-review dataset spanning multiple scientific domains; (2) a unified suite of attacks, including black-box prompt injections and white-box perturbations, specifically designed to target both text (GCG) and figures (PGD); and (3) a practical defense, motivated by the long-context challenge of academic papers, that uses chunk-based embedding search to efficiently localize and mitigate harmful instructions. Our extensive experiments, conducted across state-of-the-art models, confirm that AI reviewers are pervasively vulnerable. PaperGuard establishes the foundational benchmark, protocols, and actionable defense necessary to pioneer trustworthy, attack-resilient AI-assisted scholarly reviewing.

Problem

Research questions and friction points this paper is trying to address.

multimodal peer review

adversarial attacks

AI reviewer robustness

domain-specific manipulation

scientific publishing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Peer Review

Adversarial Attacks

PaperGuard