🤖 AI Summary
To address poor downstream model generalization in synthetic data training caused by insufficient prompt diversity, this work proposes leveraging automatically generated captions—produced by vision-language models (e.g., BLIP-2) on real images—as natural, semantically rich prompts, replacing hand-crafted class names. Specifically, high-fidelity image captions are concatenated with class names to form composite prompts that guide Stable Diffusion for generating more discriminative synthetic images. Theoretically, we establish, for the first time, a formal connection between prompt distribution diversity and synthetic data training efficacy. Practically, our approach mitigates class name ambiguity and improves semantic consistency. Evaluated on ImageNette, ImageNet-100, and ImageNet-1K, our method yields an average +10% improvement in downstream classification accuracy, significantly enhancing the utility of synthetic data for model training.
📝 Abstract
With the rapid development of Artificial Intelligence Generated Content (AIGC), it has become common practice in many learning tasks to train or fine-tune large models on synthetic data due to the data-scarcity and privacy leakage problems. Albeit promising with unlimited data generation, owing to massive and diverse information conveyed in real images, it is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts, which usually leads to inferior generalization performance when training downstream models. In this paper, we theoretically analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. Then we correspondingly propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Specifically, we caption each real image with the advanced captioning model to obtain informative and faithful prompts that extract class-relevant information and clarify the polysemy of class names. The image captions and class names are concatenated to prompt generative models for training image synthesis. Extensive experiments on ImageNette, ImageNet-100, and ImageNet-1K verify that our method significantly improves the performance of models trained on synthetic training data, i.e., 10% classification accuracy improvements on average.