🤖 AI Summary
Joint evaluation of fairness and relevance in recommender systems has long suffered from unreliable separate assessments or manually weighted metrics with weak correlations to classical measures (e.g., NDCG).
Method: We propose Distance to Pareto Frontier (DPFR), the first recommendation evaluation metric incorporating Pareto optimality: it constructs a two-dimensional Pareto frontier using any fairness metric (e.g., Statistical Parity) and any relevance metric (e.g., NDCG), then quantifies overall joint performance via Euclidean distance to this frontier. DPFR is modular, unbiased, and compatible with standard metrics.
Results: Experiments across four models, three re-ranking strategies, and six datasets demonstrate that DPFR significantly outperforms existing joint metrics. It reveals that most mainstream metrics deviate substantially from the Pareto frontier, thereby enhancing evaluation robustness and credibility.
📝 Abstract
Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different best models: one for fairness and another for relevance. Approach (ii) is also problematic because these measures tend to be ad-hoc and do not relate well to traditional relevance measures, like NDCG. Motivated by this, we present a new approach for jointly evaluating fairness and relevance in RSs: Distance to Pareto Frontier (DPFR). Given some user-item interaction data, we compute their Pareto frontier for a pair of existing relevance and fairness measures, and then use the distance from the frontier as a measure of the jointly achievable fairness and relevance. Our approach is modular and intuitive as it can be computed with existing measures. Experiments with 4 RS models, 3 re-ranking strategies, and 6 datasets show that existing metrics have inconsistent associations with our Pareto-optimal solution, making DPFR a more robust and theoretically well-founded joint measure for assessing fairness and relevance. Our code: https://github.com/theresiavr/DPFR-recsys-evaluation