🤖 AI Summary
Language models (LMs) inherently memorize training data, rendering them vulnerable to extraction attacks; however, existing evaluations—often confined to single models and fixed prompts—severely underestimate real-world threats. This work adopts an attacker’s perspective to systematically model multi-source collaborative extraction capabilities across three dimensions: model scale, training checkpoints, and prompt sensitivity. We propose an adversarial extraction framework integrating prompt perturbation, cross-model transfer, and checkpoint traversal, coupled with a multi-task verification mechanism for data provenance, copyright detection, and PII identification. Key findings include: (i) fine-tuned prompts or smaller/earlier-checkpoint models increase extraction diversity; (ii) our multi-result fusion strategy doubles success rates. Risk escalates significantly under both unmitigated and deduplication-defended settings, consistently outperforming baselines across pretraining-data localization, copyright attribution, and privacy extraction tasks.
📝 Abstract
Language models are prone to memorizing parts of their training data which makes them vulnerable to extraction attacks. Existing research often examines isolated setups--such as evaluating extraction risks from a single model or with a fixed prompt design. However, a real-world adversary could access models across various sizes and checkpoints, as well as exploit prompt sensitivity, resulting in a considerably larger attack surface than previously studied. In this paper, we revisit extraction attacks from an adversarial perspective, focusing on how to leverage the brittleness of language models and the multi-faceted access to the underlying data. We find significant churn in extraction trends, i.e., even unintuitive changes to the prompt, or targeting smaller models and earlier checkpoints, can extract distinct information. By combining information from multiple attacks, our adversary is able to increase the extraction risks by up to $2 imes$. Furthermore, even with mitigation strategies like data deduplication, we find the same escalation of extraction risks against a real-world adversary. We conclude with a set of case studies, including detecting pre-training data, copyright violations, and extracting personally identifiable information, showing how our more realistic adversary can outperform existing adversaries in the literature.