When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work uncovers a privacy risk wherein large language models (LLMs) inadvertently leak memorized numeric string patterns from their training data when generating synthetic tabular data. To exploit this vulnerability, we propose LevAtt—the first black-box membership inference attack specifically designed for numeric strings—requiring only generated samples to achieve high-accuracy membership identification. We further introduce a numeric perturbation-based sampling strategy that effectively mitigates such leakage without compromising data fidelity or downstream machine learning performance. Experimental results demonstrate that LevAtt achieves 100% membership inference accuracy across multiple LLMs and datasets. Meanwhile, our perturbation method reduces the attack success rate to near-random baseline levels (≈50%), while preserving the statistical utility and predictive performance of the synthetic data on standard ML tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.

Problem

Research questions and friction points this paper is trying to address.

Attacks privacy in LLM-based tabular data generation

Exposes memorization of numeric strings in synthetic data

Proposes defenses to mitigate leakage while preserving utility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing LevAtt, a no-box membership inference attack

Targeting string sequences of numeric digits in synthetic data

Proposing a novel sampling strategy to perturb digits

🔎 Similar Papers

Why LLMs Are Bad at Synthetic Table Generation (and what to do about it)