🤖 AI Summary
Visual-language object detectors (VLODs) suffer substantial zero-shot performance degradation under domain shifts. To address this, we propose a test-time adaptation (TTA) framework that operates without source-domain data or model fine-tuning and is compatible with mainstream VLOD architectures—including YOLO-World and Grounding DINO. Our method introduces two key innovations: (1) an IoU-weighted entropy minimization objective that prioritizes spatially coherent, high-quality region proposal clusters; and (2) an image-conditioned prompt selection and fusion mechanism that dynamically identifies and weights the most discriminative text prompts, thereby mitigating confirmation bias and interference from isolated predictions. Evaluated across diverse distribution shifts—including style transfer, autonomous driving scenes, low-light conditions, and multiple image corruptions—our approach consistently outperforms zero-shot baselines and existing TTA methods, achieving significant gains in cross-domain robustness and detection accuracy.
📝 Abstract
Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA