YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit high optimism bias, poor robustness, and opaque decision-making when automatically evaluating scientific question-answering (QA) tasks. To address these issues, this paper proposes YESciEval—a novel framework for transparent, robust, and cost-free automatic evaluation. Methodologically, it introduces the first rubric-guided reinforcement learning paradigm for scientific QA assessment, integrating fine-grained rubric modeling, RL-based fine-tuning, adversarial sample construction, and multi-LLM cross-validation—entirely without human feedback or proprietary models. Key contributions include: (1) releasing the first interdisciplinary scientific QA benchmark with adversarial examples; (2) substantially reducing optimism bias while achieving zero-cost, high inter-annotator agreement (Cohen’s κ > 0.85), and strong generalization across diverse LLM evaluators; and (3) open-sourcing a fully reproducible, extensible, and transparent framework that establishes a new paradigm for scientific evaluation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry and artificial general intelligence.
Problem

Research questions and friction points this paper is trying to address.

Evaluating robustness of LLMs in scientific question-answering
Mitigating optimism bias in LLM evaluators via reinforcement learning
Providing scalable, cost-free evaluation independent of proprietary models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained rubric-based assessment with reinforcement learning
Multidisciplinary science Q&A datasets with adversarial variants
Scalable cost-free evaluation without proprietary models
🔎 Similar Papers
No similar papers found.