Leveraging Data Symmetries to Select an Optimal Subset of Training Data under Label Noise

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the challenge of selecting high-quality data subsets from noisy labels to achieve performance approaching that of noise-free training. The authors observe that conventional k-nearest neighbors (k-NN) suffer degraded performance in high-dimensional, label-noisy settings and propose a novel approach that integrates symmetry and invariance priors into subset selection. Specifically, they introduce symmetry into the cutstats framework for the first time and theoretically demonstrate that leveraging invariance enables k-NN to asymptotically approach the Bayes optimal classifier. Moreover, they show that even with only partial knowledge of symmetries, effective modeling is achievable through learned symmetry-aware representations. Empirical results confirm that the proposed method substantially improves subset selection quality under high-dimensional label noise, yielding downstream model performance close to that attainable with clean labels.
📝 Abstract
The performance of machine learning models often relies on large labeled datasets; however, data collected from diverse sources can contain label noise. Recent work has shown that, in noisy settings, there may exist a subset of the training data on which models can achieve performance comparable to training on a noise-free dataset. A widely used method for identifying such subsets is cutstats, which employs k-nearest neighbors (k-NN) to detect low-noise samples. However, its performance on high-dimensional data remains largely unexplored. In this work, we formally establish that the performance of a classifier trained on a subset of a noisy dataset selected via cutstats is influenced by the accuracy of k-NN. We further demonstrate that, in noisy environments, exploiting data invariance and knowledge of underlying symmetries can significantly enhance the performance of k-NN, bringing it closer to the Bayes optimal classifier even in high-dimensional regimes. Finally, we show that for real-world scenarios, where information about the underlying invariance is only partially known, learnt invariant representations can still facilitate the identification of near-optimal subsets.
Problem

Research questions and friction points this paper is trying to address.

label noise
training data subset selection
high-dimensional data
data symmetries
k-nearest neighbors
Innovation

Methods, ideas, or system contributions that make the work stand out.

data symmetries
label noise
k-nearest neighbors
invariant representations
subset selection
🔎 Similar Papers
2024-05-21arXiv.orgCitations: 1