🤖 AI Summary
Text-to-image (T2I) models exhibit high sensitivity to prompt phrasing, where minor lexical changes often yield semantically inaccurate or visually inconsistent outputs. To address this, we propose a training-free, test-time prompt optimization framework operating in a closed-loop manner: leveraging a pre-trained multimodal large language model (MLLM), it dynamically assesses alignment discrepancies between generated images and the original prompt during inference; then iteratively refines the prompt, interprets the image, and verifies alignment—enabling end-to-end self-correction. This is the first work to embed a human artist–like iterative refinement mechanism directly into T2I inference, requiring no model fine-tuning, additional training, or white-box access—thus fully compatible with arbitrary black-box T2I models. Extensive evaluation across multiple benchmarks demonstrates significant improvements in semantic fidelity and visual coherence, effectively mitigating prompt sensitivity.
📝 Abstract
Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.