🤖 AI Summary
Diffusion models suffer from poor generalization in small-target-domain transfer learning, while test-time guidance methods incur high computational overhead and compromise sample diversity. To address these challenges, we propose DogFit, a domain-guided fine-tuning framework. Its core innovation lies in internalizing test-time guidance into the fine-tuning stage: lightweight conditional encoders dynamically inject domain-aware guidance offsets, and two scheduling strategies—late-start and truncation—implicitly balance fidelity and diversity during training. Built upon DiT/SiT architectures, DogFit leverages the strong marginal estimation capability of pretrained unconditional source-domain models and enables controllable generation with a single forward pass. Experiments across six target domains demonstrate that DogFit significantly outperforms existing guidance methods, achieving substantial improvements in FID and FDDINOV2 scores, reducing sampling TFLOPS by up to 2×, and incurring zero additional inference-time computation.
📝 Abstract
Transfer learning of diffusion models to smaller target domains is challenging, as naively fine-tuning the model often results in poor generalization. Test-time guidance methods help mitigate this by offering controllable improvements in image fidelity through a trade-off with sample diversity. However, this benefit comes at a high computational cost, typically requiring dual forward passes during sampling. We propose the Domain-guided Fine-tuning (DogFit) method, an effective guidance mechanism for diffusion transfer learning that maintains controllability without incurring additional computational overhead. DogFit injects a domain-aware guidance offset into the training loss, effectively internalizing the guided behavior during the fine-tuning process. The domain-aware design is motivated by our observation that during fine-tuning, the unconditional source model offers a stronger marginal estimate than the target model. To support efficient controllable fidelity-diversity trade-offs at inference, we encode the guidance strength value as an additional model input through a lightweight conditioning mechanism. We further investigate the optimal placement and timing of the guidance offset during training and propose two simple scheduling strategies, i.e., late-start and cut-off, which improve generation quality and training stability. Experiments on DiT and SiT backbones across six diverse target domains show that DogFit can outperform prior guidance methods in transfer learning in terms of FID and FDDINOV2 while requiring up to 2x fewer sampling TFLOPS.