Do Before You Judge: Self-Reference as a Pathway to Better LLM Evaluation

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While the LLM-as-Judge paradigm is increasingly adopted for automated evaluation, the relationship between a language model’s generative capability and its judging capability remains poorly understood, with conflicting empirical findings. Method: Through systematic experiments across 11 models and 21 diverse tasks, we first establish—both at the dataset and instance levels—that generative and judging abilities exhibit only weak correlation, primarily due to models’ insufficient sensitivity to the quality of candidate answers. To address this, we propose “self-referential evaluation”: a novel paradigm wherein the model’s own generated response serves as the reference standard, enabling self-comparative judgment prompts. Contribution/Results: This approach substantially strengthens the correlation between generation and judgment (average +0.42 Pearson coefficient), yielding a reliable proxy metric for model selection. Extensive multi-task evaluations confirm its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
LLM-as-Judge frameworks are increasingly popular for AI evaluation, yet research findings on the relationship between models' generation and judgment abilities remain inconsistent. We investigate this relationship through systematic dataset- and instance-level analyses across 11 models and 21 diverse tasks. Despite both capabilities relying on the same underlying knowledge, our analyses reveal they are only weakly correlated, primarily due to LLMs' sensitivity to the responses being judged. To address this, we propose a self-reference-guided evaluation strategy that leverages a model's own answers as references. This approach significantly strengthens the correlation between generation and judgment abilities, offering a practical path to align these skills and providing a reliable proxy for model selection in evaluation tasks.
Problem

Research questions and friction points this paper is trying to address.

Investigates inconsistent relationship between LLMs' generation and judgment capabilities
Addresses LLMs' sensitivity to responses being judged in evaluation frameworks
Proposes self-reference strategy to align generation and judgment abilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-reference-guided evaluation strategy
Leveraging model's own answers as references
Aligning generation and judgment abilities
🔎 Similar Papers
No similar papers found.
W
Wei-Hsiang Lin
Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
S
Sheng-Lun Wei
Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
Hen-Hsen Huang
Hen-Hsen Huang
Institute of Information Science, Academia Sinica, Taiwan
natural language processingdiscourse analysisinformation retrievalChinese processing
Hsin-Hsi Chen
Hsin-Hsi Chen
Professor of Computer Science, National Taiwan University
Natural Language ProcessingInformation RetrievalInformation ExtractionWeb MiningArtificial Intelligence