Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

Existing zero-shot image classification methods (e.g., CLIP) rely on human-annotated image–text pairs for cross-modal alignment, incurring high data curation costs, and typically adopt dual-tower architectures that hinder model lightweighting. To address these limitations, we propose LGCLIP: a novel framework that leverages large language models (LLMs) to generate class-specific textual prompts, which guide diffusion models to synthesize class prototype images; classification is then performed using only a single-tower vision encoder—eliminating dependence on real-world image–text pairs entirely. LGCLIP establishes the first zero-shot paradigm integrating LLMs and diffusion models to collaboratively generate visual prototypes, enabling pure label-input classification with a single-tower encoder. Evaluated on multiple benchmarks, LGCLIP matches CLIP’s performance while substantially reducing data dependency and architectural complexity, empirically validating synthetic prototype images as effective and efficient semantic anchors.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.

Problem

Research questions and friction points this paper is trying to address.

Reducing dependency on annotated image-text pairs for training

Eliminating dual-tower encoders to achieve lightweight models

Using only class labels for zero-shot image classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates prompts for diffusion model

Synthetic images serve as visual prototypes

Only visual encoder needed for lightweight design

🔎 Similar Papers

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better