🤖 AI Summary
Multi-subject customization faces two key challenges: scarcity of multi-subject training data and entanglement of cross-subject attributes. To address these, we propose a novel method enabling robust multi-subject generation from only a single subject’s images. Our approach introduces (1) a bias-mitigating dual-branch learning framework that disentangles shared subject representations from identity-specific features, and (2) a dynamic attention-based routing mechanism—replacing static routing—to achieve fine-grained subject–attribute alignment. Built upon single-subject training data, the method employs a dual-branch LoRA architecture to enhance representational separability and generalization. Extensive experiments demonstrate that our method consistently outperforms existing multi-subject-data-dependent approaches across image fidelity, subject identity consistency, and interaction naturalness. It significantly improves both practicality and scalability of multi-subject customization, enabling high-fidelity generation without requiring multiple subjects’ exemplars.
📝 Abstract
Current multi-subject customization approaches encounter two critical challenges: the difficulty in acquiring diverse multi-subject training data, and attribute entanglement across different subjects. To bridge these gaps, we propose MUSAR - a simple yet effective framework to achieve robust multi-subject customization while requiring only single-subject training data. Firstly, to break the data limitation, we introduce debiased diptych learning. It constructs diptych training pairs from single-subject images to facilitate multi-subject learning, while actively correcting the distribution bias introduced by diptych construction via static attention routing and dual-branch LoRA. Secondly, to eliminate cross-subject entanglement, we introduce dynamic attention routing mechanism, which adaptively establishes bijective mappings between generated images and conditional subjects. This design not only achieves decoupling of multi-subject representations but also maintains scalable generalization performance with increasing reference subjects. Comprehensive experiments demonstrate that our MUSAR outperforms existing methods - even those trained on multi-subject dataset - in image quality, subject consistency, and interaction naturalness, despite requiring only single-subject dataset.