With Privacy, Size Matters: On the Importance of Dataset Size in Differentially Private Text Rewriting

📅 2025-11-01

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing differentially private (DP) natural language processing (NLP) research largely overlooks the impact of dataset scale on the privacy–utility trade-off in DP text rewriting. Method: We systematically investigate this scale effect through dynamic partitioning experiments on million-scale textual data, evaluating how utility—measured by linguistic quality and downstream task performance—and privacy guarantees—quantified by ε-differential privacy—vary across diverse data scales. Contribution/Results: We propose the first scale-aware joint evaluation framework for DP text rewriting and establish a more rigorous benchmarking standard. Experimental results demonstrate that larger datasets significantly improve utility: under identical privacy budgets (ε), DP text rewriting achieves superior performance at scale. This finding provides both theoretical grounding and practical guidance for deploying DP NLP mechanisms in real-world large-scale applications.

Technology Category

Application Category

📝 Abstract

Recent work in Differential Privacy with Natural Language Processing (DP NLP) has proposed numerous promising techniques in the form of text rewriting mechanisms. In the evaluation of these mechanisms, an often-ignored aspect is that of dataset size, or rather, the effect of dataset size on a mechanism's efficacy for utility and privacy preservation. In this work, we are the first to introduce this factor in the evaluation of DP text privatization, where we design utility and privacy tests on large-scale datasets with dynamic split sizes. We run these tests on datasets of varying size with up to one million texts, and we focus on quantifying the effect of increasing dataset size on the privacy-utility trade-off. Our findings reveal that dataset size plays an integral part in evaluating DP text rewriting mechanisms; additionally, these findings call for more rigorous evaluation procedures in DP NLP, as well as shed light on the future of DP NLP in practice and at scale.

Problem

Research questions and friction points this paper is trying to address.

Investigating dataset size impact on differentially private text rewriting mechanisms

Evaluating privacy-utility trade-off variations with increasing dataset scale

Addressing overlooked dataset size factor in DP NLP mechanism assessments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating DP text mechanisms with varying dataset sizes

Testing privacy-utility trade-off on large-scale datasets

Quantifying dataset size impact on DP rewriting performance

🔎 Similar Papers

NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human