VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Visual-language object detectors (VLODs) suffer substantial zero-shot performance degradation under domain shifts. To address this, we propose a test-time adaptation (TTA) framework that operates without source-domain data or model fine-tuning and is compatible with mainstream VLOD architectures—including YOLO-World and Grounding DINO. Our method introduces two key innovations: (1) an IoU-weighted entropy minimization objective that prioritizes spatially coherent, high-quality region proposal clusters; and (2) an image-conditioned prompt selection and fusion mechanism that dynamically identifies and weights the most discriminative text prompts, thereby mitigating confirmation bias and interference from isolated predictions. Evaluated across diverse distribution shifts—including style transfer, autonomous driving scenes, low-light conditions, and multiple image corruptions—our approach consistently outperforms zero-shot baselines and existing TTA methods, achieving significant gains in cross-domain robustness and detection accuracy.

Technology Category

Application Category

📝 Abstract

Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region proposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language detectors to domain shifts during testing

Reducing performance degradation under diverse distribution shifts

Improving object detection accuracy through test-time adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

IoU-weighted entropy objective reduces confirmation bias

Image-conditioned prompt selection ranks compatibility

Fuses informative prompts with detector logits

🔎 Similar Papers

Efficient Open Set Single Image Test Time Adaptation of Vision Language Models