Tuning LLM Judges Hyperparameters

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-as-a-judge frameworks suffer from unstable裁判 model performance, high inference cost, and poor reproducibility. Method: This paper introduces the first multi-objective, multi-fidelity hyperparameter search framework explicitly designed for optimizing裁判 models. It integrates Bayesian optimization, LLM-as-a-judge architectures, controllable prompt engineering, and lightweight surrogate modeling to achieve Pareto-optimal trade-offs between accuracy and inference cost. All components are implemented using open-weight models to ensure accessibility and full reproducibility. Contribution/Results: The optimized裁判 models achieve state-of-the-art accuracy across multiple evaluation benchmarks, reduce assessment cost significantly, and demonstrate strong cross-task generalization capability—without reliance on proprietary APIs or closed-source models.

Technology Category

Application Category

📝 Abstract
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune hyperparameter of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Automated Evaluation
Cost-Effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-objective optimization
Multi-fidelity approach
Hyperparameter tuning