VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Visual foundation models suffer substantial performance degradation under distribution shifts and label scarcity, while existing self-supervised adaptation methods yield limited improvements on vision-centric encoders. To address this, we propose an unsupervised fine-tuning framework tailored for multi-view object videos—the first to introduce object-centric video self-supervision into visual foundation model domain adaptation. Our approach integrates self-distillation, multi-view instance alignment, LoRA-based parameter-efficient fine-tuning, and contrastive feature learning to jointly exploit temporal and viewpoint invariances, thereby mitigating catastrophic forgetting. Extensive experiments across three state-of-the-art visual foundation models and two cross-domain benchmarks demonstrate that our method significantly outperforms both supervised fine-tuning and existing unsupervised adaptation baselines on downstream classification tasks—achieving high-accuracy, zero-shot, annotation-free domain adaptation.

Technology Category

Application Category

📝 Abstract

Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.

Problem

Research questions and friction points this paper is trying to address.

Adapts vision models to new domains without requiring labeled data

Uses object-centric videos for self-supervised domain adaptation

Addresses performance degradation under distribution shifts and scarce labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised fine-tuning for vision foundation models

Leverages multi-view object-centric videos without annotations

Uses self-distillation paradigm with parameter-efficient adaptation techniques

🔎 Similar Papers

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

2024-07-28arXiv.orgCitations: 0