Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

📅 2024-03-28
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of manual prompt engineering, strong model dependency, poor interpretability, and limited cross-model transferability in personalized text-to-image generation, this paper proposes PRISM—the first black-box prompt optimization framework leveraging large language model (LLM) in-context learning. PRISM requires only multi-model API calls—no gradient access or internal model parameters—and integrates reference-image-guided iterative candidate prompt refinement with a unified API abstraction layer to automatically generate human-readable, semantically precise, and cross-model-transferable prompts. Experiments across Stable Diffusion, DALL·E, and MidJourney demonstrate that PRISM significantly improves generation accuracy for objects, artistic styles, and composite concepts. Moreover, the generated prompts achieve state-of-the-art (SOTA) interpretability and cross-model generalizability, establishing a new benchmark for controllable, transparent, and portable prompt optimization in diffusion-based generative modeling.

Technology Category

Application Category

📝 Abstract
Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.
Problem

Research questions and friction points this paper is trying to address.

Automating labor-intensive prompt engineering for T2I models
Enhancing transferability of prompts across black-box T2I systems
Generating human-interpretable prompts without white-box model access
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated black-box prompt engineering algorithm
Leverages LLM in-context learning for refinement
Generates transferable prompts for multiple T2I models
🔎 Similar Papers
No similar papers found.