Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work systematically evaluates the effectiveness of LLM-as-judges in test-time scaling (TTS). To this end, we introduce JETTS—the first benchmark covering response re-ranking, step-level beam search, and critique-driven refinement—across mathematical reasoning, code generation, and instruction following. Experiments employ a cross-model collaboration framework involving 10 LLM-judges (7B–70B parameters) and 8 generators (6.7B–72B parameters). Results reveal that while LLM-judges match outcome-based reward models in re-ranking, they underperform process-based reward models in step-level beam search; moreover, their natural-language critiques yield no measurable improvement in generation quality. The core contribution is the rigorous delineation of LLM-judges’ capability boundaries in TTS, demonstrating that effective guidance requires explicit process modeling and structured feedback—not merely scalar or textual judgments.

Technology Category

Application Category

📝 Abstract

Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM-judges' effectiveness in test-time scaling settings

Comparing judge models with reward models in different tasks

Assessing natural language critiques' impact on response refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM-judges for test-time scaling evaluation

Introduces JETTS benchmark across three domains

Compares judges with reward models in performance

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks