Visual Modality Prompt for Adapting Vision-Language Object Detectors

📅 2024-12-01

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address the significant degradation in zero-shot performance of vision-language object detectors when extended to novel visual modalities—such as infrared and depth imagery—this paper proposes ModPrompt, a fine-tuning-free modality-adaptive visual prompting method. Its core innovation is the first encoder-decoder visual prompting architecture specifically designed for detection tasks, coupled with a computationally efficient modality prompt decoupling residual mechanism that overcomes the representational limitations of conventional linear prompts in multimodal adaptation. ModPrompt preserves the original model’s full zero-shot generalization capability while enabling plug-and-play integration with mainstream frameworks including YOLO-World and Grounding DINO. Extensive experiments on LLVIP and FLIR (infrared) and NYUv2 (depth) benchmarks demonstrate performance on par with full fine-tuning, validating both its effectiveness and strong cross-modal generalizability.

Technology Category

Application Category

📝 Abstract

The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Code available at: https://github.com/heitorrapela/ModPrompt.

Problem

Research questions and friction points this paper is trying to address.

Adapt vision-language detectors to new modalities

Preserve zero-shot performance during adaptation

Enhance robustness with encoder-decoder visual prompt strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encoder-decoder visual prompt strategy

Modality prompt decoupled residual integration

Preserves zero-shot capability during adaptation

🔎 Similar Papers

No similar papers found.