🤖 AI Summary
Existing LLM-KG evaluation benchmarks overemphasize answer accuracy while neglecting systematic characterization of task-level cognitive complexity, leading to fragmented capability assessment, undetected blind spots, and insufficient task diversity. Method: This work introduces, for the first time, three established cognitive complexity frameworks from cognitive psychology into LLM-KG benchmark analysis, enabling multidimensional complexity modeling of knowledge graph (KG) tasks on LLM-KG-Bench. Contribution/Results: We uncover severe imbalances in cognitive demand distribution across current evaluations—particularly underrepresentation of higher-order reasoning and multi-step planning tasks. Our findings provide empirical grounding and a principled design paradigm for developing more interpretable, balanced, and challenging KG evaluation tasks. This advances KG evaluation from a purely outcome-oriented paradigm toward a dual-dimensional framework that jointly assesses both process-level reasoning and underlying cognitive capabilities.
📝 Abstract
Large Language Models (LLMs) are increasingly used for tasks involving Knowledge Graphs (KGs), whose evaluation typically focuses on accuracy and output correctness. We propose a complementary task characterization approach using three complexity frameworks from cognitive psychology. Applying this to the LLM-KG-Bench framework, we highlight value distributions, identify underrepresented demands and motivate richer interpretation and diversity for benchmark evaluation tasks.