J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI evaluation suffers from insufficient interpretability and computational inefficiency: conventional reward models yield scalar scores without attribution, while LLM-as-a-Judge offers interpretability at the cost of prohibitive inference overhead. To address this, we propose **Simple Test-Time Scaling (STTS)**—a lightweight, test-time extension method—and establish, for the first time, that **reinforcement learning (RL) training is the critical phase enabling discriminative models to harness STTS effectively**. Building upon this insight, we introduce J1-7B, a discriminative model trained via reflection-enhanced supervised fine-tuning, verifiable-reward RL, and integrated STTS. J1-7B enables reasoning-trajectory-driven, verifiable evaluation. Experiments show that J1-7B outperforms prior state-of-the-art discriminative models by 4.8% on major benchmarks; STTS further boosts performance by 5.1%, with gains attributable primarily to the RL stage—not supervised fine-tuning—demonstrating the pivotal role of RL in unlocking scalable, interpretable evaluation.

Technology Category

Application Category

📝 Abstract
The current focus of AI research is shifting from emphasizing model training towards enhancing evaluation quality, a transition that is crucial for driving further advancements in AI systems. Traditional evaluation methods typically rely on reward models assigning scalar preference scores to outputs. Although effective, such approaches lack interpretability, leaving users often uncertain about why a reward model rates a particular response as high or low. The advent of LLM-as-a-Judge provides a more scalable and interpretable method of supervision, offering insights into the decision-making process. Moreover, with the emergence of large reasoning models, which consume more tokens for deeper thinking and answer refinement, scaling test-time computation in the LLM-as-a-Judge paradigm presents an avenue for further boosting performance and providing more interpretability through reasoning traces. In this paper, we introduce $ extbf{J1-7B}$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling and subsequently trained using Reinforcement Learning (RL) with verifiable rewards. At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement. Experimental results demonstrate that $ extbf{J1-7B}$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ extbf{4.8}$% and exhibits a $ extbf{5.1}$% stronger scaling trend under STTS. Additionally, we present three key findings: (1) Existing LLM-as-a-Judge does not inherently exhibit such scaling trend. (2) Model simply fine-tuned on reflection-enhanced datasets continues to demonstrate similarly weak scaling behavior. (3) Significant scaling trend emerges primarily during the RL phase, suggesting that effective STTS capability is acquired predominantly through RL training.
Problem

Research questions and friction points this paper is trying to address.

Enhancing evaluation quality in AI systems for better interpretability
Scaling test-time computation in LLM-as-a-Judge for performance boost
Improving LLM-as-a-Judge accuracy and interpretability via reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Simple Test-Time Scaling for performance boost
Employs Reinforcement Learning with verifiable rewards
Leverages reflection-enhanced datasets for fine-tuning
🔎 Similar Papers
No similar papers found.
Chi-Min Chan
Chi-Min Chan
HKUST
Large Language ModelsPost-TrainingAlignmentLLM Agents
Chunpu Xu
Chunpu Xu
PolyU
Multimodal learningNatural language processing
J
Jiaming Ji
Peking University
Z
Zhen Ye
Hong Kong University of Science and Technology
P
Pengcheng Wen
Hong Kong University of Science and Technology
Chunyang Jiang
Chunyang Jiang
HKUST
Artificial IntelligenceNatural Language Processing
Y
Yaodong Yang
Peking University
W
Wei Xue
Hong Kong University of Science and Technology
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
Y
Yike Guo
Hong Kong University of Science and Technology