🤖 AI Summary
Existing large language model (LLM) evaluation methods rely heavily on domain experts, manual annotation, and labeled datasets, resulting in poor scalability and subjectivity.
Method: This paper introduces SKATE—a fully automated, human-free, and label-free scalable evaluation framework. SKATE models compete by mutually generating and solving verifiable tasks (e.g., code output prediction), formalizing evaluation as a game-theoretic process. It leverages self-preference mechanisms to uncover capability biases and integrates the TrueSkill rating system for fine-grained ability differentiation.
Contribution/Results: SKATE is the first framework to empirically demonstrate that weaker models can reliably distinguish stronger ones—a counterintuitive yet robust finding. Extensive experiments across six state-of-the-art LLMs validate its effectiveness, offering a general, objective, and scalable paradigm for AI evaluation without human supervision or ground-truth labels.
📝 Abstract
Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.