ZIPP:Zero-shot Image Personalization from Personas

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of adapting text-to-image diffusion models to individual aesthetic preferences in cold-start scenarios, where no user history or model fine-tuning is available. The authors propose a zero-shot personalization method that leverages large language models to rewrite prompts by incorporating natural language descriptions of user personas, thereby guiding image generation toward user-specific preferences. Key contributions include the first fully zero-shot approach to personalized image generation, a scalable graph attention–based technique for extracting and verbalizing user personas from social interactions, and ZIPBench—the first benchmark for evaluating zero-shot personalization. Experiments demonstrate consistent performance gains, with average improvements of 13–20% across four benchmarks, human preference win rates of 79% against generic generation and 58–65% against fine-tuned baselines, and significantly reduced preference distribution divergence (CMMD: 0.16) and subgroup bias.

📝 Abstract

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

Problem

Research questions and friction points this paper is trying to address.

image personalization

zero-shot learning

text-to-image generation

user preference modeling

cold-start problem

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot personalization

persona-based generation

graph attention network