🤖 AI Summary
Addressing the challenge of reliably inferring AI systems’ cognitive capabilities from heterogeneous, few-shot task performance, this paper proposes a Bayesian triangulation framework for cognitive profiling. The method introduces a “measurement layout” generative model (implemented in PyMC) that jointly models task-instance features, latent capability dimensions, and system responses—thereby overcoming traditional psychometric reliance on large-scale, homogeneous datasets. Its key innovation lies in the first integration of Bayesian latent-variable modeling with multi-task cross-validation, enabling individualized, architecture-agnostic cognitive capability inversion. Evaluated on the AnimalAI Olympics benchmark (68 competing agents) and the O-PIAAGETS benchmark (30 synthetic agents), the framework successfully reconstructs fine-grained cognitive profiles, significantly enhancing discriminability and interpretability of inferred capabilities. Results empirically validate the feasibility and effectiveness of capability-oriented evaluation as a principled alternative to conventional behavioral benchmarks.
📝 Abstract
As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.