🤖 AI Summary
Current large language models struggle to generate in-depth, evidence-based peer review comments in scientific evaluation due to a lack of active inquiry capabilities. This work proposes ProReviewer, the first framework to integrate an active investigation mechanism into the LLM-based reviewing process by formulating peer review as a Markov decision process. ProReviewer employs structured logs as a dynamic workspace to iteratively track evidence and intermediate findings, enabling proactive critique generation. Built upon an 8B-parameter model and optimized through supervised fine-tuning and reinforcement learning, ProReviewer outperforms the strongest fine-tuned baseline by 16% on average across five quality dimensions and surpasses larger state-of-the-art models by 39%. Human evaluations further confirm its superiority, with ProReviewer achieving the highest win rate among all compared systems.
📝 Abstract
Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.