Test-time Prompt Refinement for Text-to-Image Models

📅 2025-07-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models exhibit high sensitivity to prompt phrasing, where minor lexical changes often yield semantically inaccurate or visually inconsistent outputs. To address this, we propose a training-free, test-time prompt optimization framework operating in a closed-loop manner: leveraging a pre-trained multimodal large language model (MLLM), it dynamically assesses alignment discrepancies between generated images and the original prompt during inference; then iteratively refines the prompt, interprets the image, and verifies alignment—enabling end-to-end self-correction. This is the first work to embed a human artist–like iterative refinement mechanism directly into T2I inference, requiring no model fine-tuning, additional training, or white-box access—thus fully compatible with arbitrary black-box T2I models. Extensive evaluation across multiple benchmarks demonstrates significant improvements in semantic fidelity and visual coherence, effectively mitigating prompt sensitivity.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.
Problem

Research questions and friction points this paper is trying to address.

Addresses prompt sensitivity in text-to-image models
Improves alignment between prompts and generated images
Enables iterative refinement without model retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop prompt refinement framework
Pretrained MLLM analyzes image-prompt alignment
Iterative refinement improves visual coherence
🔎 Similar Papers
No similar papers found.
M
Mohammad Abdul Hafeez Khan
Florida Institute of Technology, Melbourne, USA
Yash Jain
Yash Jain
Essential
Foundation ModelsComputer VisionMulti-modal learning
S
Siddhartha Bhattacharyya
Florida Institute of Technology, Melbourne, USA
Vibhav Vineet
Vibhav Vineet
Microsoft Research
computer visionmachine learningArtificial IntelligenceRobotics