J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current AI evaluation suffers from insufficient interpretability and computational inefficiency: conventional reward models yield scalar scores without attribution, while LLM-as-a-Judge offers interpretability at the cost of prohibitive inference overhead. To address this, we propose **Simple Test-Time Scaling (STTS)**—a lightweight, test-time extension method—and establish, for the first time, that **reinforcement learning (RL) training is the critical phase enabling discriminative models to harness STTS effectively**. Building upon this insight, we introduce J1-7B, a discriminative model trained via reflection-enhanced supervised fine-tuning, verifiable-reward RL, and integrated STTS. J1-7B enables reasoning-trajectory-driven, verifiable evaluation. Experiments show that J1-7B outperforms prior state-of-the-art discriminative models by 4.8% on major benchmarks; STTS further boosts performance by 5.1%, with gains attributable primarily to the RL stage—not supervised fine-tuning—demonstrating the pivotal role of RL in unlocking scalable, interpretable evaluation.

Technology Category

Application Category

📝 Abstract

The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning traces. In this paper, we introduce $ extbf{J1-7B}$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling and subsequently trained using Reinforcement Learning (RL) with verifiable rewards. At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement. Experimental results demonstrate that $ extbf{J1-7B}$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ extbf{4.8}$% and exhibits a $ extbf{5.1}$% stronger scaling trend under STTS. Additionally, we present three key findings: (1) Existing LLM-as-a-Judge does not inherently exhibit such scaling trend. (2) Model simply fine-tuned on reflection-enhanced datasets continues to demonstrate similarly weak scaling behavior. (3) Significant scaling trend emerges primarily during the RL phase, suggesting that effective STTS capability is acquired predominantly through RL training.

Problem

Research questions and friction points this paper is trying to address.

Enhancing evaluation quality in AI systems for better interpretability

Scaling test-time computation in LLM-as-a-Judge for performance boost

Improving LLM-as-a-Judge accuracy and interpretability via reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Simple Test-Time Scaling for performance boost

Employs Reinforcement Learning with verifiable rewards

Leverages reflection-enhanced datasets for fine-tuning

🔎 Similar Papers

No similar papers found.

Authors to Follow