ZIPP:Zero-shot Image Personalization from Personas

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of adapting text-to-image diffusion models to individual aesthetic preferences in cold-start scenarios, where no user history or model fine-tuning is available. The authors propose a zero-shot personalization method that leverages large language models to rewrite prompts by incorporating natural language descriptions of user personas, thereby guiding image generation toward user-specific preferences. Key contributions include the first fully zero-shot approach to personalized image generation, a scalable graph attention–based technique for extracting and verbalizing user personas from social interactions, and ZIPBench—the first benchmark for evaluating zero-shot personalization. Experiments demonstrate consistent performance gains, with average improvements of 13–20% across four benchmarks, human preference win rates of 79% against generic generation and 58–65% against fine-tuned baselines, and significantly reduced preference distribution divergence (CMMD: 0.16) and subgroup bias.
📝 Abstract
Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.
Problem

Research questions and friction points this paper is trying to address.

image personalization
zero-shot learning
text-to-image generation
user preference modeling
cold-start problem
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot personalization
persona-based generation
graph attention network
diffusion models
LLM prompt rewriting
🔎 Similar Papers
No similar papers found.