SLMJury: Can Small Language Models Judge as Well as Large Ones?

📅 2026-06-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the scalability bottlenecks of large language models (LLMs) as automated evaluators—namely high cost, latency, and opacity—and proposes SLMJury, a framework for systematically evaluating small language models (SLMs) with 0.6B–14B parameters on both closed-ended binary judgment and open-ended quality scoring tasks. Through multi-benchmark evaluation spanning eight domains (including mathematics and science), budget-constrained modeling, a Reflect-Critique-Refine multi-agent debate protocol, and adversarial role testing, the study reveals that the “overthinking” effect in SLM-based evaluation is domain-dependent and that generalization capabilities vary significantly across model families. Experiments demonstrate that shorter responses outperform extended reasoning in mathematical tasks, certain SLMs match LLM performance on specific benchmarks, and these models exhibit high robustness to adversarial attacks (variance ≤0.55%), collectively indicating that reliable automated evaluation need not rely on large, closed-source models.

📝 Abstract

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

Problem

Research questions and friction points this paper is trying to address.

small language models

automated evaluation

model judging

binary correctness

quality scoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Small Language Models

Automated Evaluation

Judging Framework