🤖 AI Summary
To address the significant degradation in zero-shot performance of vision-language object detectors when extended to novel visual modalities—such as infrared and depth imagery—this paper proposes ModPrompt, a fine-tuning-free modality-adaptive visual prompting method. Its core innovation is the first encoder-decoder visual prompting architecture specifically designed for detection tasks, coupled with a computationally efficient modality prompt decoupling residual mechanism that overcomes the representational limitations of conventional linear prompts in multimodal adaptation. ModPrompt preserves the original model’s full zero-shot generalization capability while enabling plug-and-play integration with mainstream frameworks including YOLO-World and Grounding DINO. Extensive experiments on LLVIP and FLIR (infrared) and NYUv2 (depth) benchmarks demonstrate performance on par with full fine-tuning, validating both its effectiveness and strong cross-modal generalizability.
📝 Abstract
The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt.