ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image (T2I) benchmarks rely on static prompts, introducing evaluation bias due to model sensitivity to prompt phrasing and systematically underestimating true generative capability. Method: We propose a multimodal iterative prompt optimization framework that leverages vision-language model (VLM)-based feedback to automatically refine prompts—enhancing fidelity in fine-grained concepts such as spatial relations and shape. Our approach constructs a model-agnostic, parameter-free prompt optimization pipeline. Contribution/Results: Optimized prompts significantly improve compositional generation performance (+12.7% on COMPOSER). The pipeline achieves >83% cross-model transfer success across heterogeneous architectures—including Stable Diffusion, SDXL, and DALL·E 3—revealing that mainstream benchmarks underestimate model capability by an average of 19.4%. This work shifts T2I evaluation from “prompt-centric” to “capability-centric” assessment, establishing a new paradigm for fair, robust, and generalizable model comparison.

Technology Category

Application Category

📝 Abstract
Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts -- such as spatial relationships and shapes -- benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.
Problem

Research questions and friction points this paper is trying to address.

Current T2I benchmarks use rigid prompts, causing unfair model comparisons.
ConceptMix++ optimizes prompts to reveal hidden model capabilities fairly.
Existing benchmarks underestimate model performance on spatial and shape concepts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative prompt optimization for fair benchmarking
Multimodal pipeline with vision-language model feedback
Cross-model transferability of optimized prompts
🔎 Similar Papers
No similar papers found.