Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pre-trained model-assisted text-to-image GANs suffer from severely degraded generation diversity and prohibitively high training costs. Method: We propose an efficient, high-fidelity generation framework featuring (i) a dual-specialized discriminator that disentangles semantic alignment from image realism discrimination; (ii) Slicing Adversarial Networks (SANs) tailored for fine-grained text–image matching to strengthen adversarial learning; and (iii) CLIP-guided optimization with zero-shot FID regularization. Contribution/Results: We introduce the Per-Prompt Diversity (PPD) metric—the first to quantitatively evaluate intra-prompt generation diversity. Our approach reduces training cost by two orders of magnitude while achieving zero-shot FID on par with state-of-the-art large-scale GANs. It significantly improves both generation diversity and image fidelity, setting a new benchmark for efficiency and quality in pre-trained model-augmented text-to-image synthesis.

Technology Category

Application Category

📝 Abstract
Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.
Problem

Research questions and friction points this paper is trying to address.

Reducing high training cost of large-scale text-to-image GANs
Improving generation diversity for given text prompts
Maintaining sample fidelity while enhancing model efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates pre-trained models for cost efficiency
Uses two specialized discriminators with SANs
Introduces Per-Prompt Diversity (PPD) metric
🔎 Similar Papers
No similar papers found.