Exploring the Equivalence of Closed-Set Generative and Real Data Augmentation in Image Classification

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the efficacy of closed-set generative data augmentation for image classification—specifically, whether generative models (e.g., diffusion models) trained solely on the target training set can produce synthetic samples equivalent to real images for improving classifier performance. Method: We conduct large-scale experiments across natural and medical imaging benchmarks, systematically evaluating how closed-set versus open-set synthetic data affects classifier generalization. We introduce quantitative metrics to empirically assess the functional equivalence between synthetic and real data augmentations. Contribution/Results: We establish the first empirical quantification of this equivalence and propose actionable guidelines for synthetic data integration. Our analysis reveals a nonlinear relationship between synthetic data volume and augmentation gain. Crucially, high-fidelity closed-set generation achieves classification accuracy comparable to real-data augmentation when synthetic samples constitute 30–50% of the augmented training set—providing a theoretically grounded, low-cost, and privacy-preserving alternative to conventional augmentation strategies.

Technology Category

Application Category

📝 Abstract
In this paper, we address a key scientific problem in machine learning: Given a training set for an image classification task, can we train a generative model on this dataset to enhance the classification performance? (i.e., closed-set generative data augmentation). We start by exploring the distinctions and similarities between real images and closed-set synthetic images generated by advanced generative models. Through extensive experiments, we offer systematic insights into the effective use of closed-set synthetic data for augmentation. Notably, we empirically determine the equivalent scale of synthetic images needed for augmentation. In addition, we also show quantitative equivalence between the real data augmentation and open-set generative augmentation (generative models trained using data beyond the given training set). While it aligns with the common intuition that real images are generally preferred, our empirical formulation also offers a guideline to quantify the increased scale of synthetic data augmentation required to achieve comparable image classification performance. Our results on natural and medical image datasets further illustrate how this effect varies with the baseline training set size and the amount of synthetic data incorporated.
Problem

Research questions and friction points this paper is trying to address.

Comparing real and synthetic image augmentation for classification
Determining equivalent synthetic data scale for effective augmentation
Quantifying synthetic data needed to match real augmentation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-set generative data augmentation for classification
Quantify synthetic data scale for equivalent performance
Compare real and open-set generative augmentation
🔎 Similar Papers
No similar papers found.
H
Haowen Wang
University of California, San Diego
Guowei Zhang
Guowei Zhang
Carnegie Mellon University
Embodied AIComputer VisionReinforcement Learning
X
Xiang Zhang
University of California, San Diego
Z
Zeyuan Chen
University of California, San Diego
H
Haiyang Xu
University of California, San Diego
D
Dou Hoon Kwark
University of Illinois at Urbana-Champaign
Zhuowen Tu
Zhuowen Tu
Professor, Cognitive Science, Computer Science&Engineering, UC San Diego
Computer VisionMachine LearningDeep LearningNeural Computation