What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

๐Ÿ“… 2024-08-23
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 7
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In text-to-image synthesis (TIS), novice users struggle to generate desired outputs due to limited prompt engineering skills, while existing automatic prompt generation methods lack interpretability and interactivity. To address this, we propose DialPromptโ€”the first multi-turn conversational prompt generation framework designed specifically for novices. It guides users iteratively to clarify 15 key visual attribute dimensions (e.g., style, composition, lighting), enabling interpretable mapping between prompt elements and visual attributes, as well as real-time user intervention. DialPrompt is trained on a novel, self-constructed multi-turn prompt optimization dataset and integrates preference-aware conditional generation with feedback-driven iterative refinement. Experiments demonstrate that DialPrompt improves image quality by 5.7% over state-of-the-art prompt engineering baselines, increases user-centeredness scores by 46.5%, and achieves an expert overall rating of 7.9/10.

Technology Category

Application Category

๐Ÿ“ Abstract
The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models heavily rely on the quality and specificity of textual prompts, posing a challenge for novice users who may not be familiar with TIS-model-preferred prompt writing. Existing solutions relieve this via automatic model-preferred prompt generation from user queries. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. To address these issues, we propose DialPrompt, a multi-turn dialogue-based TIS prompt generation model that emphasises user-centricity. DialPrompt is designed to follow a multi-turn guidance workflow, where in each round of dialogue the model queries user with their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt can improve interpretability by allowing users to understand the correlation between specific phrases and image attributes. Additionally, it enables greater user control and engagement in the prompt generation process, leading to more personalized and visually satisfying outputs. Experiments indicate that DialPrompt achieves a competitive result in the quality of synthesized images, outperforming existing prompt engineering approaches by 5.7%. Furthermore, in our user evaluation, DialPrompt outperforms existing approaches by 46.5% in user-centricity score and is rated 7.9/10 by 19 human reviewers.
Problem

Research questions and friction points this paper is trying to address.

Generating user-friendly prompts for text-to-image synthesis
Improving novice users' control over image generation process
Enhancing interactivity through multi-turn dialogue guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn dialogue guides prompt generation
Mined 15 dimensions from advanced users
User-centric control over prompt creation process
๐Ÿ”Ž Similar Papers
No similar papers found.