Why Settle for One? Text-to-ImageSet Generation and Evaluation

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces Text-to-Image Set (T2IS) generation—a novel task that synthesizes a coherent set of images from a single text prompt, enforcing diverse consistency constraints (e.g., style, layout, content) across the set—thereby overcoming limitations of existing models, which support only single-domain or single-consistency generation. To enable systematic evaluation, we construct T2IS-Bench, the first fine-grained benchmark comprising 26 categories and 596 diverse prompts, along with T2IS-Eval, a comprehensive evaluation framework for quantitative, multi-dimensional consistency assessment. Methodologically, we propose AutoT2IS, a training-free framework leveraging in-context learning in pretrained diffusion Transformers to jointly model inter-image consistency without architectural modification. Extensive experiments on T2IS-Bench demonstrate that AutoT2IS significantly outperforms both general-purpose and task-specific baselines, exhibits strong cross-category generalization, and enables several underexplored real-world applications.

Technology Category

Application Category

📝 Abstract
Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $ extbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $ extbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $ extbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in https://chengyou-jia.github.io/T2IS-Home.
Problem

Research questions and friction points this paper is trying to address.

Generating coherent image sets with diverse consistency requirements
Overcoming limitations of domain-specific consistent methods
Assessing and fulfilling multifaceted consistency in image sets
Innovation

Methods, ideas, or system contributions that make the work stand out.

T2IS-Bench for diverse instruction coverage
T2IS-Eval framework for multifaceted assessment
AutoT2IS leverages pretrained Diffusion Transformers
🔎 Similar Papers
No similar papers found.