Inferring Capabilities from Task Performance with Bayesian Triangulation

📅 2023-09-21
🏛️ arXiv.org
📈 Citations: 9
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of reliably inferring AI systems’ cognitive capabilities from heterogeneous, few-shot task performance, this paper proposes a Bayesian triangulation framework for cognitive profiling. The method introduces a “measurement layout” generative model (implemented in PyMC) that jointly models task-instance features, latent capability dimensions, and system responses—thereby overcoming traditional psychometric reliance on large-scale, homogeneous datasets. Its key innovation lies in the first integration of Bayesian latent-variable modeling with multi-task cross-validation, enabling individualized, architecture-agnostic cognitive capability inversion. Evaluated on the AnimalAI Olympics benchmark (68 competing agents) and the O-PIAAGETS benchmark (30 synthetic agents), the framework successfully reconstructs fine-grained cognitive profiles, significantly enhancing discriminability and interpretability of inferred capabilities. Results empirically validate the feasibility and effectiveness of capability-oriented evaluation as a principled alternative to conventional behavioral benchmarks.
📝 Abstract
As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data -- a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.
Problem

Research questions and friction points this paper is trying to address.

Infer cognitive profiles from diverse experimental task performance data
Model task-feature and capability interactions affecting system performance
Enable capability inference from non-populational data using Bayesian methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian triangulation infers cognitive profiles from data
Measurement layouts model task capability interactions
Probabilistic programming infers capabilities from non populational data
🔎 Similar Papers
No similar papers found.