CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

Current LLM evaluation predominantly relies on judge models assigning single scalar scores, lacking fine-grained, actionable attribution of error causes. To address this, we propose the first operational error analysis framework tailored for RAG and mathematical reasoning tasks. Our method leverages LLMs to generate instance-level diagnostic feedback, applies clustering to uncover systematic error patterns, and delivers an open-source, interactive visualization dashboard enabling dynamic filtering and drill-down exploration. Crucially, we extend the judging paradigm from mere scoring to structured, interpretable error attribution—shifting evaluation from “which model performs better?” to “why and how does it fail?”. Experiments across multiple benchmarks and user studies demonstrate that our framework significantly improves error comprehension efficiency, reveals salient error categories and their distributional characteristics, and provides explainable, deployable support for model diagnosis and iterative improvement.

Technology Category

Application Category

📝 Abstract

The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluations lack detailed performance reasons

Current methods only provide scores or rankings

No tools for interactive error analysis exist

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates per-instance textual feedback

Creates system-level error issues

Provides interactive dashboard for analysis

🔎 Similar Papers

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence