Dataset Ownership Verification in Contrastive Pre-trained Models

📅 2025-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses dataset ownership verification for self-supervised contrastive learning models. We propose the first verifiable decision framework tailored to black-box backbone networks, enabling provenance attribution of specific pretraining datasets even in label-free settings. Our method leverages statistically significant disparities—quantified via p-values—in unary (single-instance) versus binary (positive/negative pair) instance relationships within the embedding space. The framework is model-agnostic and seamlessly integrates with mainstream self-supervised architectures, including SimCLR, BYOL, SimSiam, MoCo v3, and DINO. Extensive experiments demonstrate statistically significant verification (p < 0.05) across all evaluated models, consistently outperforming existing approaches in accuracy, robustness, and applicability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at https://github.com/xieyc99/DOV4CL.
Problem

Research questions and friction points this paper is trying to address.

Verify dataset ownership in unsupervised models
Tailored for contrastive learning pre-trained models
Identify unlabeled dataset usage in black-box models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning verification method
Unsupervised pre-trained models analysis
Dataset ownership empirical validation
🔎 Similar Papers
No similar papers found.
Y
Yuechen Xie
Zhejiang University
J
Jie Song
Zhejiang University
Mengqi Xue
Mengqi Xue
Zhejiang University, Hangzhou City University
Machine Learning
H
Haofei Zhang
Zhejiang University
X
Xingen Wang
Zhejiang University, Bangsheng Technology Co., Ltd.
B
Bingde Hu
Zhejiang University, Bangsheng Technology Co., Ltd.
G
Genlang Chen
NingboTech University
M
Mingli Song
Zhejiang University