🤖 AI Summary
Large language models (LLMs) exhibit high optimism bias, poor robustness, and opaque decision-making when automatically evaluating scientific question-answering (QA) tasks. To address these issues, this paper proposes YESciEval—a novel framework for transparent, robust, and cost-free automatic evaluation. Methodologically, it introduces the first rubric-guided reinforcement learning paradigm for scientific QA assessment, integrating fine-grained rubric modeling, RL-based fine-tuning, adversarial sample construction, and multi-LLM cross-validation—entirely without human feedback or proprietary models. Key contributions include: (1) releasing the first interdisciplinary scientific QA benchmark with adversarial examples; (2) substantially reducing optimism bias while achieving zero-cost, high inter-annotator agreement (Cohen’s κ > 0.85), and strong generalization across diverse LLM evaluators; and (3) open-sourcing a fully reproducible, extensible, and transparent framework that establishes a new paradigm for scientific evaluation.
📝 Abstract
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.