Statistical Multicriteria Evaluation of LLM-Generated Text

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM text quality evaluation faces challenges in simultaneously accommodating multiple dimensions (e.g., coherence, diversity, fluency), reconciling automatic metrics with human judgments, and lacking statistical inference guarantees. To address these, we propose GSD-front—a generalized stochastic dominance-based framework for multidimensional joint statistical assessment under non-i.i.d. assumptions. GSD-front unifies ordinal human ratings and cardinal automatic metrics within a single, weight-free modeling framework. By leveraging partial order analysis, it enables robust, statistically grounded comparisons among decoding strategies, revealing significant multidimensional quality differences relative to human-written references. Experiments demonstrate that GSD-front delivers interpretable, reproducible, and statistically significant multi-criteria evaluation—overcoming the limitations of single-metric reliance and opaque, aggregated scoring schemes.

Technology Category

Application Category

📝 Abstract
Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM text quality with multiple nuanced criteria
Addressing single-metric limitations in current evaluation methods
Ensuring statistical validity in multi-dimensional text assessments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Generalized Stochastic Dominance framework
Evaluates multiple text quality dimensions simultaneously
Avoids arbitrary weighting of metrics
🔎 Similar Papers
No similar papers found.