UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

164K/year
🤖 AI Summary
Existing approaches struggle to accurately disentangle and localize specific concepts in multi-object images, often relying on segmentation supervision and exhibiting limited compositional generalization. This work proposes UniVerse, a framework that introduces a unified modulation mechanism within a diffusion Transformer, enabling the first method capable of disentangling and personalizing multiple concepts without requiring segmentation masks. UniVerse supports both compositionality and decomposability of concepts, significantly enhancing fine-grained object localization and representation in complex scenes. Experimental results demonstrate that UniVerse outperforms current state-of-the-art methods across multiple benchmarks, achieving superior performance in both localization accuracy and visual fidelity.
📝 Abstract
Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.
Problem

Research questions and friction points this paper is trying to address.

personalized visual understanding
multi-concept disentanglement
segmentation-free localization
compositional generalization
concept extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

segmentation-free
disentangled representation
multi-concept personalization
diffusion transformers
unified modulation
🔎 Similar Papers
No similar papers found.