MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

πŸ“… 2026-06-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the ethical and copyright concerns arising from the opacity of black-box large language models by proposing a training data detection method that requires no access to internal model parameters or output probability distributions. Inspired by masked language modeling, the approach identifies high-specificity tokens in candidate documents, masks them, and evaluates the model’s prediction accuracy via API queries. By comparing hit rates between original and masked versions and applying statistical significance testing, the method enables corpus-level attribution. As the first technique to achieve effective detection under strict black-box conditions, it demonstrates consistently strong performance across both open-source and proprietary models, matching the efficacy of existing white-box approaches and offering a practical tool for model auditing and copyright verification.
πŸ“ Abstract
Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.
Problem

Research questions and friction points this paper is trying to address.

pretraining data detection
black-box LLMs
data transparency
model auditing
copyright verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

pretraining data detection
black-box LLM
masked language modeling
model auditing
data copyright verification
πŸ”Ž Similar Papers
No similar papers found.