🤖 AI Summary
This work addresses the challenge of fairly evaluating recommendation algorithms across diverse datasets, where performance is influenced by factors such as sparsity, sequential structure, and scale. Conventional evaluation methods relying on average metrics often distort algorithm rankings and hinder equitable comparison. To overcome this, the authors propose a data-driven ranking framework based on the Bradley–Terry (BT) model, enhanced with BT trees and covariate-augmented BT models that incorporate dataset-specific statistical features. This approach yields more consistent and robust evaluation metrics, accurately capturing how dataset characteristics affect algorithmic performance. Notably, it maintains ranking stability even under partial data missingness and enables prediction of an algorithm’s relative performance on unseen datasets without requiring actual execution.
📝 Abstract
The ranking of recommendation algorithms is a challenging problem since model performance is sensitive to dataset characteristics such as sparsity, sequential structure, and scale. This drives a demand for a proper methodology for fair comparison between algorithms. Naive aggregation of performance metrics (e.g., averaging NDCG over benchmarks) can yield misleading rankings, undermining practical selection. To address this problem, we introduce a novel, data-driven ranking methodology based on Bradley-Terry (BT) model. We demonstrate that the obtained ranking depends on key dataset statistics. Additionally, we propose a novel metric for evaluating ranking consistency and demonstrate robustness of our ranking to incomplete data. Finally, we introduce a dataset-specific methodology for ranking algorithms on unseen datasets without running the models, relying on extensions of the Bradley-Terry framework, including BT trees and BT models with covariates.