🤖 AI Summary
To address low personalization accuracy, heavy reliance on explicit user intervention, and token-capacity limitations of text encoders in text-to-image (T2I) diffusion models, this paper proposes DrUM: a lightweight Transformer adapter operating in the latent space that conditions generation on structured user profile embeddings—without fine-tuning the base model. DrUM is the first method to deeply integrate structured user profiles into the conditional mechanism of diffusion models, achieving seamless compatibility with open-source text encoders and enabling dynamic latent-space injection of user signals. This design significantly enhances personalized representation capability. Experiments demonstrate that DrUM is plug-and-play on mainstream T2I models (e.g., Stable Diffusion), requiring only minimal user data to generate high-fidelity, semantically consistent personalized images. It outperforms existing adapter-based approaches across multiple benchmarks.
📝 Abstract
Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning.