Test-time Prompt Refinement for Text-to-Image Models

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Text-to-image (T2I) models exhibit high sensitivity to prompt phrasing, where minor lexical changes often yield semantically inaccurate or visually inconsistent outputs. To address this, we propose a training-free, test-time prompt optimization framework operating in a closed-loop manner: leveraging a pre-trained multimodal large language model (MLLM), it dynamically assesses alignment discrepancies between generated images and the original prompt during inference; then iteratively refines the prompt, interprets the image, and verifies alignment—enabling end-to-end self-correction. This is the first work to embed a human artist–like iterative refinement mechanism directly into T2I inference, requiring no model fine-tuning, additional training, or white-box access—thus fully compatible with arbitrary black-box T2I models. Extensive evaluation across multiple benchmarks demonstrates significant improvements in semantic fidelity and visual coherence, effectively mitigating prompt sensitivity.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) generation models have made significant strides but still struggle with prompt sensitivity: even minor changes in prompt wording can yield inconsistent or inaccurate outputs. To address this challenge, we introduce a closed-loop, test-time prompt refinement framework that requires no additional training of the underlying T2I model, termed TIR. In our approach, each generation step is followed by a refinement step, where a pretrained multimodal large language model (MLLM) analyzes the output image and the user's prompt. The MLLM detects misalignments (e.g., missing objects, incorrect attributes) and produces a refined and physically grounded prompt for the next round of image generation. By iteratively refining the prompt and verifying alignment between the prompt and the image, TIR corrects errors, mirroring the iterative refinement process of human artists. We demonstrate that this closed-loop strategy improves alignment and visual coherence across multiple benchmark datasets, all while maintaining plug-and-play integration with black-box T2I models.

Problem

Research questions and friction points this paper is trying to address.

Addresses prompt sensitivity in text-to-image models

Improves alignment between prompts and generated images

Enables iterative refinement without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-loop prompt refinement framework

Pretrained MLLM analyzes image-prompt alignment

Iterative refinement improves visual coherence

🔎 Similar Papers

No similar papers found.