Empirical Cumulative Distribution Function Clustering for LLM-based Agent System Analysis

๐Ÿ“… 2026-02-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

218K/year
๐Ÿค– AI Summary
This work addresses a critical limitation in current evaluation methods for large language model (LLM) agents, which rely on majority voting and fail to distinguish between configurations yielding similar overall accuracy but differing response quality distributions. To overcome this, the authors propose a fine-grained evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This study is the first to introduce ECDF into LLM agent system evaluation, further integrating ECDF-based distance metrics with k-medoids clustering to analyze response distributions. Experimental results demonstrate that the proposed approach effectively uncovers interpretable effects of key factorsโ€”such as temperature, role prompting, and question topicโ€”on response quality distributions, substantially enhancing both the discriminative power and analytical insight of LLM agent evaluations.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.
Problem

Research questions and friction points this paper is trying to address.

LLM-based agent
response distribution
evaluation framework
empirical cumulative distribution function
quality assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

ECDF clustering
LLM agent evaluation
cosine similarity distribution
k-medoids
response quality analysis