🤖 AI Summary
Current evaluations in relational learning rely on unified leaderboards that overlook intrinsic geometric differences among datasets, often leading to misleading judgments of model generalization. This work proposes a curvature-stratified, geometry-aware evaluation framework, categorizing 14 datasets into positively curved, negatively curved, and near-zero curvature groups, and systematically assessing the performance of 18 models—including GCNs, graph foundation models, and tabular methods—across these geometric regimes. Experiments reveal for the first time that model performance is highly dependent on data geometry: rankings remain stable within each curvature regime but shift significantly across regimes. Notably, in certain settings, curvature-aligned GNNs even outperform graph foundation models. These findings challenge the universality assumption underlying conventional aggregate metrics and establish a more fine-grained, structure-aware paradigm for evaluating relational learning methods.
📝 Abstract
Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry-dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade-offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature-stratified evaluation framework that partitions datasets into positive, negative, and near-zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry-dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry-aligned GNNs. Based on these findings, we propose a geometry-aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature-stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: https://sirbabbage.github.io/CurvBench_HOME/.