Image Captions are Natural Prompts for Text-to-Image Models

📅 2023-07-17

🏛️ arXiv.org

📈 Citations: 22

✨ Influential: 2

career value

167K/year

🤖 AI Summary

To address poor downstream model generalization in synthetic data training caused by insufficient prompt diversity, this work proposes leveraging automatically generated captions—produced by vision-language models (e.g., BLIP-2) on real images—as natural, semantically rich prompts, replacing hand-crafted class names. Specifically, high-fidelity image captions are concatenated with class names to form composite prompts that guide Stable Diffusion for generating more discriminative synthetic images. Theoretically, we establish, for the first time, a formal connection between prompt distribution diversity and synthetic data training efficacy. Practically, our approach mitigates class name ambiguity and improves semantic consistency. Evaluated on ImageNette, ImageNet-100, and ImageNet-1K, our method yields an average +10% improvement in downstream classification accuracy, significantly enhancing the utility of synthetic data for model training.

📝 Abstract

With the rapid development of Artificial Intelligence Generated Content (AIGC), it has become common practice in many learning tasks to train or fine-tune large models on synthetic data due to the data-scarcity and privacy leakage problems. Albeit promising with unlimited data generation, owing to massive and diverse information conveyed in real images, it is challenging for text-to-image generative models to synthesize informative training data with hand-crafted prompts, which usually leads to inferior generalization performance when training downstream models. In this paper, we theoretically analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. Then we correspondingly propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Specifically, we caption each real image with the advanced captioning model to obtain informative and faithful prompts that extract class-relevant information and clarify the polysemy of class names. The image captions and class names are concatenated to prompt generative models for training image synthesis. Extensive experiments on ImageNette, ImageNet-100, and ImageNet-1K verify that our method significantly improves the performance of models trained on synthetic training data, i.e., 10% classification accuracy improvements on average.

Problem

Research questions and friction points this paper is trying to address.

Overcoming data-scarcity and privacy issues with synthetic training data

Generating informative images for prediction tasks using proper prompts

Enhancing model generalization and out-of-distribution robustness with synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using image captions as prompts

Concatenating captions with class names

Enhancing synthetic data informativeness

🔎 Similar Papers

No similar papers found.