A Decade's Battle on Dataset Bias: Are We There Yet?

📅 2024-03-13
🏛️ arXiv.org
📈 Citations: 19
Influential: 1
📄 PDF
🤖 AI Summary
This work challenges the prevailing assumption that dataset bias has been mitigated in modern vision models, systematically investigating their ability to discriminate image provenance. Using three large-scale open datasets—YFCC, Conceptual Captions (CC), and DataComp—we conduct three-way classification experiments with state-of-the-art pre-trained vision models and complement them with feature transferability and generalization analyses. Results show that models achieve 84.7% accuracy in identifying the source dataset on held-out validation sets, demonstrating persistent and substantial dataset-level bias. Crucially, the discriminative features exhibit semantic coherence and cross-task transferability, providing the first empirical evidence that large models learn generalizable semantic patterns—not mere memorization. This reveals latent systemic biases in current dataset curation and model evaluation practices, offering new perspectives for bias modeling, dataset auditing, and robust generalization research.

Technology Category

Application Category

📝 Abstract
We revisit the"dataset classification"experiment suggested by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization. We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.
Problem

Research questions and friction points this paper is trying to address.

Reevaluating dataset bias in modern neural networks.
Exploring dataset classification accuracy with diverse datasets.
Investigating generalizable features learned by dataset classifiers.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modern neural networks classify datasets accurately
Dataset classifier learns generalizable semantic features
Inspires rethinking dataset bias in machine learning
🔎 Similar Papers
No similar papers found.