VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

πŸ“… 2025-10-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large Vision-Language Models (VLMs) suffer substantial performance degradation on cross-domain video tasks due to distribution shift, while existing fine-tuning approaches often induce catastrophic forgetting or insufficient domain adaptation. To address this, we propose Learnable Visual Probesβ€”a lightweight, trainable visual context probing module that dynamically recalibrates target-domain visual representations without updating the frozen backbone parameters. Our method jointly optimizes across views, modalities, and tasks in an end-to-end manner to enhance domain adaptation efficiency. Experiments on multiple video domain transfer benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods, achieving an average 5.2% improvement in target-domain accuracy while preserving nearly all source-domain inference capability. This effectively balances domain specificity and knowledge stability.

Technology Category

Application Category

πŸ“ Abstract
Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.
Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language models to novel domains with distribution shifts
Preventing catastrophic forgetting while learning domain-specific features
Enhancing cross-domain performance across view, modality, and task transfers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments vision encoder with learnable visual probes
Enables domain adaptation with minimal parameter modification
Outperforms existing strategies across cross-view and cross-modal tasks
πŸ”Ž Similar Papers
No similar papers found.
Dominick Reilly
Dominick Reilly
UNC Charlotte
video understandingmultimodal learning
M
Manish Kumar Govind
University of North Carolina at Charlotte
L
Le Xue
Salesforce AI Research
S
Srijan Das
University of North Carolina at Charlotte