π€ AI Summary
Facial recognition models suffer from data bias due to the discretization of ethnic labels, which obscures the inherent continuity of ethnic distributions and leads to the fundamental flaw that βequal identity counts β balanced data.β This work pioneers a continuous ethnicity modeling paradigm, replacing discrete labels with continuous ethnic embeddings. We propose a novel training framework integrating spectral-distance-driven dynamic reweighting sampling and multi-scale ethnic distribution alignment. Evaluated across 65+ models and 20+ data subsets, our method achieves an average 12.7% improvement in cross-ethnic recognition accuracy and significantly reduces false positive and false negative rate disparities. It establishes a new paradigm for data balance in continuous ethnic space and provides both theoretically grounded and empirically scalable foundations for fairness-aware modeling in facial recognition.
π Abstract
Bias has been a constant in face recognition models. Over the years, researchers have looked at it from both the model and the data point of view. However, their approach to mitigation of data bias was limited and lacked insight on the real nature of the problem. Here, in this document, we propose to revise our use of ethnicity labels as a continuous variable instead of a discrete value per identity. We validate our formulation both experimentally and theoretically, showcasing that not all identities from one ethnicity contribute equally to the balance of the dataset; thus, having the same number of identities per ethnicity does not represent a balanced dataset. We further show that models trained on datasets balanced in the continuous space consistently outperform models trained on data balanced in the discrete space. We trained more than 65 different models, and created more than 20 subsets of the original datasets.