SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

📅 2026-06-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the underexplored threat of reconstruction attacks—specifically attribute inference—on synthetic tabular data used for privacy-preserving releases. The work establishes the first systematic taxonomy and comparable evaluation framework for reconstruction attacks, encompassing 14 attack variants, 9 synthetic data generation methods, and 5 benchmark datasets, while introducing a novel attack, CoBP-RA, to fill existing gaps. By incorporating a memorization test to distinguish between distribution-level reconstruction and memorization of training samples, and by unifying reconstruction and membership inference attacks on a common evaluative scale, the analysis reveals that the choice of synthetic data method dominates risk over the specific attack employed. Differential privacy proves effective only under low privacy budgets (ε ≲ 1), de-identified data is most vulnerable, and risks predominantly stem from structural properties of the underlying data distribution, particularly affecting anomalous individuals. This work ranked first in the 2025 NIST Red Team Challenge.

📝 Abstract

Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.

Problem

Research questions and friction points this paper is trying to address.

reconstruction attacks

synthetic tabular data

privacy

attribute inference

de-identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

reconstruction attacks

synthetic tabular data

attribute inference