MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the susceptibility of large language models to bias, poor stability, and heavy reliance on annotated data in analytical essay scoring by proposing a training-free evaluation framework that uniquely integrates multi-agent debate with rubric-based retrieval-augmented generation (RAG). The framework employs an Advocate to highlight strengths, a Skeptic to identify weaknesses, and a Judge that synthesizes arguments while referencing retrieved exemplars. Through collaborative multi-agent reasoning, prompt engineering, and exemplar-based calibration, the system achieves substantially better performance than conventional prompting methods without any fine-tuning, approaching the accuracy of supervised models. Retrieval enhances calibration fidelity, while debate facilitates deeper reasoning about higher-order writing traits.

📝 Abstract

We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.

Problem

Research questions and friction points this paper is trying to address.

analytic essay scoring

LLM-as-judge

bias

scoring instability

training-free evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent debate

retrieval-augmented generation

training-free evaluation