Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This study addresses the pervasive construct validity issues in current large language model (LLM) evaluations, where method-induced variations—such as prompt sensitivity—are often misinterpreted as genuine differences in capability. To resolve this, the work proposes a generalized Multitrait-Multimethod (MTMM) framework that geometrically unifies nine prominent evaluation metrics into a latent coordinate space defined by three orthogonal dimensions: instability, positional alignment, and coverage expressiveness. Through systematic literature synthesis, MTMM validation, and manifold-based geometric modeling, the framework formalizes metrics like Paraphrase Instability and Drift Score as precise geometric measures, effectively disentangling task-irrelevant perturbations from true model competencies. The resulting benchmarking paradigm enables fine-grained decomposition of model behavior, substantially enhancing evaluation robustness, empirical stability, and domain generality, thereby establishing a structured theoretical foundation for LLM capability assessment.
📝 Abstract
The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.
Problem

Research questions and friction points this paper is trying to address.

construct validity
LLM evaluation
method variance
latent capabilities
benchmark design
Innovation

Methods, ideas, or system contributions that make the work stand out.

MTMM framework
geometric manifold
latent coordinate space
construct validity
orthogonal dimensions
🔎 Similar Papers
A
Adib Sakhawat
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
T
Tahsin Islam
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
T
Takia Farhin
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
S
Syed Rifat Raiyan
Systems and Software Lab (SSL), Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
Hasan Mahmud
Hasan Mahmud
Postdoctoral Research Associate, Rochester Institute of Technology
Information SystemsAlgorithmic decision-makingHCI/Human-AI interaction
Md Kamrul Hasan
Md Kamrul Hasan
Department of Computer Science
Smart HealthNoninvasive Blood TestImage processing