MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the high computational cost and inherent scoring biases associated with using large language models as judges (LLM-as-a-Judge) in large-scale text evaluation. To this end, the authors propose MultiwayPAM, a novel tensor clustering algorithm that, for the first time, enables simultaneous clustering and medoid estimation over a third-order rating tensor comprising questions, respondents, and evaluators. Built upon a multiway partitioning around medoids framework, the method effectively uncovers cluster structures across all modes of the tensor as well as latent bias patterns. Experimental results on two real-world datasets demonstrate the interpretability and efficacy of MultiwayPAM, offering a new avenue for understanding and mitigating evaluation biases introduced by LLMs.

Technology Category

Application Category

📝 Abstract

LLM-as-a-Judge is a flexible framework for text evaluation, which allows us to obtain scores for the quality of a given text from various perspectives by changing the prompt template. Two main challenges in using LLM-as-a-Judge are computational cost of LLM inference, especially when evaluating a large number of texts, and inherent bias of an LLM evaluator. To address these issues and reveal the structure of score bias caused by an LLM evaluator, we propose to apply a tensor clustering method to a given LLM-as-a-Judge score tensor, whose entries are the scores for different combinations of questions, answerers, and evaluators. Specifically, we develop a new tensor clustering method MultiwayPAM, with which we can simultaneously estimate the cluster membership and the medoids for each mode of a given data tensor. By observing the medoids obtained by MultiwayPAM, we can gain knowledge about the membership of each question/answerer/evaluator cluster. We experimentally show the effectiveness of MultiwayPAM by applying it to the score tensors for two practical datasets.

Problem

Research questions and friction points this paper is trying to address.

LLM-as-a-Judge

computational cost

bias

score tensor

text evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

MultiwayPAM

tensor clustering

LLM-as-a-Judge