🤖 AI Summary
This study addresses the privacy risks posed by the potential leakage of personally identifiable information (PII) from language models fine-tuned with supervised learning. It presents the first systematic investigation of this issue, introducing a domain-specific PII dataset comprising multi-turn, user-centric question-answering dialogues in the medical and legal fields, along with a comprehensive PII classification and evaluation framework. Focusing on prefix-based attack scenarios, the work proposes COVA, a novel decoding algorithm that substantially outperforms existing PII extraction methods. Experimental results demonstrate that adversaries can efficiently reconstruct PII even with limited prior knowledge, and that leakage severity varies significantly across PII types. COVA consistently enhances reconstruction success rates across diverse experimental settings, underscoring its effectiveness and robustness.
📝 Abstract
Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric Q&A datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.