Towards More Realistic Extraction Attacks: An Adversarial Perspective

📅 2024-07-02

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 1

career value

185K/year

🤖 AI Summary

Language models (LMs) inherently memorize training data, rendering them vulnerable to extraction attacks; however, existing evaluations—often confined to single models and fixed prompts—severely underestimate real-world threats. This work adopts an attacker’s perspective to systematically model multi-source collaborative extraction capabilities across three dimensions: model scale, training checkpoints, and prompt sensitivity. We propose an adversarial extraction framework integrating prompt perturbation, cross-model transfer, and checkpoint traversal, coupled with a multi-task verification mechanism for data provenance, copyright detection, and PII identification. Key findings include: (i) fine-tuned prompts or smaller/earlier-checkpoint models increase extraction diversity; (ii) our multi-result fusion strategy doubles success rates. Risk escalates significantly under both unmitigated and deduplication-defended settings, consistently outperforming baselines across pretraining-data localization, copyright attribution, and privacy extraction tasks.

Technology Category

Application Category

📝 Abstract

Language models are prone to memorizing parts of their training data which makes them vulnerable to extraction attacks. Existing research often examines isolated setups--such as evaluating extraction risks from a single model or with a fixed prompt design. However, a real-world adversary could access models across various sizes and checkpoints, as well as exploit prompt sensitivity, resulting in a considerably larger attack surface than previously studied. In this paper, we revisit extraction attacks from an adversarial perspective, focusing on how to leverage the brittleness of language models and the multi-faceted access to the underlying data. We find significant churn in extraction trends, i.e., even unintuitive changes to the prompt, or targeting smaller models and earlier checkpoints, can extract distinct information. By combining information from multiple attacks, our adversary is able to increase the extraction risks by up to $2 imes$. Furthermore, even with mitigation strategies like data deduplication, we find the same escalation of extraction risks against a real-world adversary. We conclude with a set of case studies, including detecting pre-training data, copyright violations, and extracting personally identifiable information, showing how our more realistic adversary can outperform existing adversaries in the literature.

Problem

Research questions and friction points this paper is trying to address.

Examining extraction attacks on language models with multi-faceted adversarial access

Investigating how diverse prompts and model checkpoints increase data extraction risks

Assessing real-world extraction threats including privacy and copyright violations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-faceted access to diverse model checkpoints

Combining multiple attacks to increase extraction risks

Adversarial extraction outperforms existing literature methods

🔎 Similar Papers

PII-Compass: Guiding LLM training data extraction prompts towards the target PII via grounding