From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

📅 2026-06-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current large language models struggle to generate in-depth, evidence-based peer review comments in scientific evaluation due to a lack of active inquiry capabilities. This work proposes ProReviewer, the first framework to integrate an active investigation mechanism into the LLM-based reviewing process by formulating peer review as a Markov decision process. ProReviewer employs structured logs as a dynamic workspace to iteratively track evidence and intermediate findings, enabling proactive critique generation. Built upon an 8B-parameter model and optimized through supervised fine-tuning and reinforcement learning, ProReviewer outperforms the strongest fine-tuned baseline by 16% on average across five quality dimensions and surpasses larger state-of-the-art models by 39%. Human evaluations further confirm its superiority, with ProReviewer achieving the highest win rate among all compared systems.

📝 Abstract

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

Problem

Research questions and friction points this paper is trying to address.

scientific peer review

proactive investigation

large language models

evidence-based review

review automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

proactive peer review

Markov Decision Process

structured review log