🤖 AI Summary
Cross-domain few-shot object detection (CD-FSOD) faces dual challenges of domain shift and severe label scarcity; existing data augmentation and generation methods struggle to simultaneously ensure visual realism, category fidelity, and target-domain background alignment. This paper proposes the first training-free retrieval-generation collaborative framework, which achieves dual semantic (object category) and stylistic (domain特征) alignment via foreground-background disentanglement and recomposition. Specifically, it first decomposes input images and retrieves domain-adapted backgrounds across domains; then employs a conditional diffusion model to synthesize domain-consistent backgrounds; finally fuses the original foreground with the generated background. Our method significantly outperforms established baselines on multiple few-shot detection benchmarks—including CD-FSOD, remote sensing object detection, and camouflaged object detection—setting new state-of-the-art performance.
📝 Abstract
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.