🤖 AI Summary
This study investigates how linguistic structure influences privacy leakage risks in multilingual large language models (LLMs), focusing on models trained on medical corpora in English, Spanish, French, and Italian. We employ three privacy attack paradigms—extraction attacks, counterfactual memorization, and membership inference—alongside six quantitative linguistic metrics (e.g., redundancy, tokenization granularity, morphological complexity) to systematically characterize the relationship between language properties and privacy vulnerability. Our analysis reveals that higher linguistic redundancy and coarser tokenization granularity correlate positively with increased privacy leakage; English models exhibit the strongest membership distinguishability and highest risk; morphologically richer French and Spanish models demonstrate greater privacy robustness; and Italian models suffer the most severe leakage. This work establishes the first linguistically grounded, interpretable framework for assessing and comparing privacy risks across multilingual LLMs.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed across multilingual applications that handle sensitive data, yet their scale and linguistic variability introduce major privacy risks. Mostly evaluated for English, this paper investigates how language structure affects privacy leakage in LLMs trained on English, Spanish, French, and Italian medical corpora. We quantify six linguistic indicators and evaluate three attack vectors: extraction, counterfactual memorization, and membership inference. Results show that privacy vulnerability scales with linguistic redundancy and tokenization granularity: Italian exhibits the strongest leakage, while English shows higher membership separability. In contrast, French and Spanish display greater resilience due to higher morphological complexity. Overall, our findings provide the first quantitative evidence that language matters in privacy leakage, underscoring the need for language-aware privacy-preserving mechanisms in LLM deployments.