GeoDE: a Geographically Diverse Evaluation Dataset for Object Recognition

📅 2023-01-05

🏛️ Neural Information Processing Systems

📈 Citations: 42

✨ Influential: 4

career value

180K/year

🤖 AI Summary

Existing object recognition datasets predominantly rely on web-crawled images, introducing geographic bias (over-representing North America and Europe), privacy risks from personally identifiable information (PII), and stereotypical biases. Method: We propose GeoDE—a geographically diverse, crowdsourced image benchmark covering six global regions and 40 categories (61,940 images)—with strict PII removal and balanced regional sampling. GeoDE introduces a novel geographic-stratified crowdsourcing paradigm for data collection and annotation, explicitly avoiding inherent biases of web-sourced data. Contribution/Results: Experiments reveal significant performance degradation of mainstream models (e.g., ResNet-50) on non-Western regions. Incremental fine-tuning with only 1,000–2,000 GeoDE images per region boosts cross-regional average accuracy by up to 12.3%, demonstrating that small-scale, geographically diverse data yields substantial gains in model generalization across regions. GeoDE is the first publicly available benchmark designed to systematically address geographic representativeness in visual recognition.

📝 Abstract

Current dataset collection methods typically scrape large amounts of data from the web. While this technique is extremely scalable, data collected in this way tends to reinforce stereotypical biases, can contain personally identifiable information, and typically originates from Europe and North America. In this work, we rethink the dataset collection paradigm and introduce GeoDE, a geographically diverse dataset with 61,940 images from 40 classes and 6 world regions, and no personally identifiable information, collected through crowd-sourcing. We analyse GeoDE to understand differences in images collected in this manner compared to web-scraping. Despite the smaller size of this dataset, we demonstrate its use as both an evaluation and training dataset, highlight shortcomings in current models, as well as show improved performances when even small amounts of GeoDE (1000 - 2000 images per region) are added to a training dataset. We release the full dataset and code at https://geodiverse-data-collection.cs.princeton.edu/

Problem

Research questions and friction points this paper is trying to address.

Addresses geographic bias in object recognition datasets

Introduces a diverse dataset from six world regions

Mitigates stereotypical biases and privacy concerns in data collection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geographically diverse dataset collection method

Soliciting global images to avoid biases

No personally identifiable information included

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)