🤖 AI Summary
This work addresses the challenge of reference-free text quality assessment. Methodologically, it systematically analyzes geometric properties—such as intrinsic dimensionality, effective rank, and Schatten norms—of intermediate activation representations across multiple layers of large language models (LLMs). Empirical analysis reveals strong correlations between these geometric measures and both textual naturalness and human-perceived quality, consistently across diverse LLMs and architectural layers. Crucially, the study identifies intrinsic dimensionality and effective rank as universal, robust quality indicators, enabling zero-shot, reference-free evaluation without human annotations or reference texts. Experiments demonstrate that the proposed method delivers stable and reliable assessments across heterogeneous generated text corpora; furthermore, different LLMs exhibit high consensus in ranking text quality using these metrics. The approach significantly outperforms existing reference-free metrics (e.g., MAUVE) and establishes a novel paradigm for interpreting the mapping between internal model representations and textual quality.
📝 Abstract
This paper bridges internal and external analysis approaches to large language models (LLMs) by demonstrating that geometric properties of internal model representations serve as reliable proxies for evaluating generated text quality. We validate a set of metrics including Maximum Explainable Variance, Effective Rank, Intrinsic Dimensionality, MAUVE score, and Schatten Norms measured across different layers of LLMs, demonstrating that Intrinsic Dimensionality and Effective Rank can serve as universal assessments of text naturalness and quality. Our key finding reveals that different models consistently rank text from various sources in the same order based on these geometric properties, indicating that these metrics reflect inherent text characteristics rather than model-specific artifacts. This allows a reference-free text quality evaluation that does not require human-annotated datasets, offering practical advantages for automated evaluation pipelines.