Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

πŸ“… 2026-04-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the inconsistency and opacity inherent in conventional large language model (LLM) evaluation methods that rely on direct scoring. To overcome these limitations, the authors propose a confidence-aware Fuzzy Analytic Hierarchy Process (FAHP), which models cognitive uncertainty using triangular fuzzy numbers. Inspired by dual-process theory, they further introduce DualJudgeβ€”a framework that adaptively integrates intuitive and deliberative evaluation pathways. A key innovation lies in incorporating LLM-generated confidence scores into FAHP, combined with consistency-aware weighting and a multi-criteria decision-making mechanism. Experimental results on JudgeBench demonstrate that FAHP significantly outperforms direct scoring, and DualJudge achieves state-of-the-art evaluation performance, markedly enhancing judgment stability and calibration.
πŸ“ Abstract
Effective evaluation of large language models (LLMs) remains a critical bottleneck, as conventional direct scoring often yields inconsistent and opaque judgments. In this work, we adapt the Analytic Hierarchy Process (AHP) to LLM-based evaluation and, more importantly, propose a confidence-aware Fuzzy AHP (FAHP) extension that models epistemic uncertainty via triangular fuzzy numbers modulated by LLM-generated confidence scores. Systematically validated on JudgeBench, our structured approach decomposes assessments into explicit criteria and incorporates uncertainty-aware aggregation, producing more calibrated judgments. Extensive experiments demonstrate that both crisp and fuzzy AHP consistently outperform direct scoring across model scales and dataset splits, with FAHP showing superior stability in uncertain comparison scenarios. Building on these insights, we propose \textbf{DualJudge}, a hybrid framework inspired by Dual-Process Theory that adaptively fuses holistic direct scores with structured AHP outputs via consistency-aware weighting. DualJudge achieves state-of-the-art performance, underscoring the complementary strengths of intuitive and deliberative evaluation paradigms. These results establish uncertainty-aware structured reasoning as a principled pathway toward more reliable LLM assessment. Code is available at https://github.com/hreyulog/AHP_llm_judge.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Model Evaluation
Uncertainty
Multi-Criteria Decision Making
Judgment Consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuzzy Analytic Hierarchy Process
Large Language Model Evaluation
Epistemic Uncertainty
DualJudge
Structured Multi-Criteria Assessment
πŸ”Ž Similar Papers
No similar papers found.
Yulong He
Yulong He
St Petersburg University
I
Ivan Smirnov
ITMO University, Kronverkskiy av., 49, St. Petersburg, Russia
D
Dmitry Fedrushkov
ITMO University, Kronverkskiy av., 49, St. Petersburg, Russia
Sergey Kovalchuk
Sergey Kovalchuk
ITMO University
artificial intelligencehuman-AI interactioncomplex systemscomputational science
I
Ilya Revin
ITMO University, Kronverkskiy av., 49, St. Petersburg, Russia