π€ AI Summary
Existing object recognition datasets predominantly rely on web-crawled images, introducing geographic bias (over-representing North America and Europe), privacy risks from personally identifiable information (PII), and stereotypical biases. Method: We propose GeoDEβa geographically diverse, crowdsourced image benchmark covering six global regions and 40 categories (61,940 images)βwith strict PII removal and balanced regional sampling. GeoDE introduces a novel geographic-stratified crowdsourcing paradigm for data collection and annotation, explicitly avoiding inherent biases of web-sourced data. Contribution/Results: Experiments reveal significant performance degradation of mainstream models (e.g., ResNet-50) on non-Western regions. Incremental fine-tuning with only 1,000β2,000 GeoDE images per region boosts cross-regional average accuracy by up to 12.3%, demonstrating that small-scale, geographically diverse data yields substantial gains in model generalization across regions. GeoDE is the first publicly available benchmark designed to systematically address geographic representativeness in visual recognition.
π Abstract
Current dataset collection methods typically scrape large amounts of data from the web. While this technique is extremely scalable, data collected in this way tends to reinforce stereotypical biases, can contain personally identifiable information, and typically originates from Europe and North America. In this work, we rethink the dataset collection paradigm and introduce GeoDE, a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, and no personally identifiable information, collected through crowd-sourcing. We analyse GeoDE to understand differences in images collected in this manner compared to web-scraping. Despite the smaller size of this dataset, we demonstrate its use as both an evaluation and training dataset, highlight shortcomings in current models, as well as show improved performances when even small amounts of GeoDE (1000 - 2000 images per region) are added to a training dataset. We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/