Prompt Recovery for Image Generation Models: A Comparative Study of Discrete Optimizers

📅 2024-08-12

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

163K/year

🤖 AI Summary

This work investigates prompt inversion for image generation models—a discrete optimization problem. We systematically evaluate five approaches: GCG, PEZ, Random Search, AutoDAN, and the BLIP2 image captioner—delivering the first comprehensive, cross-method benchmark of discrete optimizers for this task. Key findings are: (1) CLIP similarity exhibits weak correlation with perceptual image fidelity, challenging the prevailing paradigm of optimizing solely for CLIP score; (2) prompts generated by strong image captioners (e.g., BLIP2) yield reconstructions of significantly higher fidelity than those produced by discrete optimizers; and (3) although all optimizers effectively reduce their respective objective values, corresponding improvements in actual image fidelity remain marginal. Collectively, our study establishes a new evaluation benchmark for prompt inversion and prompts critical methodological reflection on objective design, optimizer efficacy, and the role of descriptive priors in discrete prompt recovery.

Technology Category

Application Category

📝 Abstract

Recovering natural language prompts for image generation models, solely based on the generated images is a difficult discrete optimization problem. In this work, we present the first head-to-head comparison of recent discrete optimization techniques for the problem of prompt inversion. We evaluate Greedy Coordinate Gradients (GCG), PEZ , Random Search, AutoDAN and BLIP2's image captioner across various evaluation metrics related to the quality of inverted prompts and the quality of the images generated by the inverted prompts. We find that focusing on the CLIP similarity between the inverted prompts and the ground truth image acts as a poor proxy for the similarity between ground truth image and the image generated by the inverted prompts. While the discrete optimizers effectively minimize their objectives, simply using responses from a well-trained captioner often leads to generated images that more closely resemble those produced by the original prompts.

Problem

Research questions and friction points this paper is trying to address.

Recovering natural language prompts from generated images

Comparing discrete optimizers for prompt inversion

Evaluating inverted prompt quality and image similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compare discrete optimizers for prompt inversion

Evaluate CLIP similarity and image quality

Use trained captioner for better image resemblance

🔎 Similar Papers

No similar papers found.