🤖 AI Summary
Current LLM text quality evaluation faces challenges in simultaneously accommodating multiple dimensions (e.g., coherence, diversity, fluency), reconciling automatic metrics with human judgments, and lacking statistical inference guarantees. To address these, we propose GSD-front—a generalized stochastic dominance-based framework for multidimensional joint statistical assessment under non-i.i.d. assumptions. GSD-front unifies ordinal human ratings and cardinal automatic metrics within a single, weight-free modeling framework. By leveraging partial order analysis, it enables robust, statistically grounded comparisons among decoding strategies, revealing significant multidimensional quality differences relative to human-written references. Experiments demonstrate that GSD-front delivers interpretable, reproducible, and statistically significant multi-criteria evaluation—overcoming the limitations of single-metric reliance and opaque, aggregated scoring schemes.
📝 Abstract
Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding strategies, thus avoiding arbitrary weighting of the involved metrics. By applying this framework to evaluate common decoding strategies against human-generated text, we demonstrate its ability to identify statistically significant performance differences while accounting for potential deviations from the i.i.d. assumption of the sampling design.