Tuning LLM Judges Hyperparameters

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current LLM-as-a-judge frameworks suffer from unstable裁判 model performance, high inference cost, and poor reproducibility. Method: This paper introduces the first multi-objective, multi-fidelity hyperparameter search framework explicitly designed for optimizing裁判 models. It integrates Bayesian optimization, LLM-as-a-judge architectures, controllable prompt engineering, and lightweight surrogate modeling to achieve Pareto-optimal trade-offs between accuracy and inference cost. All components are implemented using open-weight models to ensure accessibility and full reproducibility. Contribution/Results: The optimized裁判 models achieve state-of-the-art accuracy across multiple evaluation benchmarks, reduce assessment cost significantly, and demonstrate strong cross-task generalization capability—without reliance on proprietary APIs or closed-source models.

Technology Category

Application Category

📝 Abstract

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune hyperparameter of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Automated Evaluation

Cost-Effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-objective optimization

Multi-fidelity approach

Hyperparameter tuning

🔎 Similar Papers

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

2024-03-05Citations: 21

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

2024-06-18arXiv.orgCitations: 25

Authors to Follow