🤖 AI Summary
This study addresses the limitation of existing text-to-image models, which rely on holistic preference scores and fail to capture designers’ fine-grained judgments across multiple dimensions such as layout, color, and visual hierarchy. To bridge this gap, the authors introduce TASTE, a dataset comprising evaluations by ten professional designers who rate outputs from four state-of-the-art text-to-image models across nine design dimensions and annotate hallucinated content. They propose a criterion-agnostic, multidimensional preference evaluation framework that employs Kendall’s tau, majority probability, and Condorcet cycles for statistical significance testing. Experimental results demonstrate that all evaluated dimensions significantly deviate from random scoring, yet the highest agreement between current models and designer consensus reaches only 0.55. In contrast, a lightweight prediction head trained on TASTE achieves a correlation of 0.611—approaching the human upper bound of 0.741—highlighting the limited capacity of contemporary vision-language models to understand nuanced design preferences.
📝 Abstract
Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall's tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.