🤖 AI Summary
This study addresses the pervasive construct validity issues in current large language model (LLM) evaluations, where method-induced variations—such as prompt sensitivity—are often misinterpreted as genuine differences in capability. To resolve this, the work proposes a generalized Multitrait-Multimethod (MTMM) framework that geometrically unifies nine prominent evaluation metrics into a latent coordinate space defined by three orthogonal dimensions: instability, positional alignment, and coverage expressiveness. Through systematic literature synthesis, MTMM validation, and manifold-based geometric modeling, the framework formalizes metrics like Paraphrase Instability and Drift Score as precise geometric measures, effectively disentangling task-irrelevant perturbations from true model competencies. The resulting benchmarking paradigm enables fine-grained decomposition of model behavior, substantially enhancing evaluation robustness, empirical stability, and domain generality, thereby establishing a structured theoretical foundation for LLM capability assessment.
📝 Abstract
The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.