🤖 AI Summary
This work addresses the limitations of existing prompt inversion methods in text-to-image generation, which often yield unnatural and semantically obscure prompts, resulting in low-fidelity image reconstructions and poor controllability. The authors propose a novel framework that introduces genetic algorithms into the natural language prompt space, enabling prompt optimization using only black-box access to the image generator. By leveraging a vision-language model to guide evolutionary search, the method automatically produces high-quality prompts that are both semantically coherent and human-readable, without requiring any internal knowledge of the generative model. Evaluated across multiple prompt inversion benchmarks, the approach significantly outperforms current state-of-the-art techniques, achieving superior image reconstruction fidelity while enhancing prompt interpretability.
📝 Abstract
Text-to-image generation has progressed rapidly, but faithfully generating complex scenes requires extensive trial-and-error to find the exact prompt. In the prompt inversion task, the goal is to recover a textual prompt that can faithfully reconstruct a given target image. Currently, existing methods frequently yield suboptimal reconstructions and produce unnatural, hard-to-interpret prompts that hinder transparency and controllability. In this work, we present PromptEvolver, a prompt inversion approach that generates natural-language prompts while achieving high-fidelity reconstructions of the target image. Our method uses a genetic algorithm to optimize the prompt, leveraging a strong vision-language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods.