In-Context Bias Propagation in LLM-Based Tabular Data Generation

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a critical vulnerability in large language models (LLMs): when generating tabular data via in-context learning (ICL), even minor statistical biases present in input examples are systematically amplified, distorting the synthetic data distribution—particularly exacerbating demographic skew in real-world settings. To investigate this phenomenon, the authors develop a rigorous methodology comprising bias quantification, adversarial context construction, and a fairness evaluation framework. Their analysis is the first to systematically demonstrate that low-magnitude biases in ICL demonstrations induce global statistical distortions and can be deliberately exploited by malicious contributors to significantly degrade downstream classifier fairness—especially for protected groups. The study introduces a reproducible metric for measuring bias propagation in ICL-based data generation, establishing both a key risk alert and a foundational safety benchmark for LLM-driven synthetic data applications.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the potential of LLMs to improve downstream task performance through augmenting underrepresented groups, these benefits often assume access to a subset of unbiased in-context examples, representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Our findings demonstrate a new vulnerability associated with LLM-based data generation pipelines that rely on in-context prompts with in sensitive domains.
Problem

Research questions and friction points this paper is trying to address.

Study bias propagation in LLM-generated tabular data
Examine adversarial bias injection via in-context examples
Assess fairness risks in downstream classifier performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for synthetic tabular data generation
Study bias propagation in in-context examples
Adversarial bias injection via in-context prompts
P
Pol G.Recasens
Barcelona Supercomputing Center, Barcelona, Catalonia, Spain
A
Alberto Gutierrez
Barcelona Supercomputing Center, Barcelona, Catalonia, Spain
Jordi Torres
Jordi Torres
UPC Barcelona Tech - Barcelona Supercomputing Center
Supercomputing for Artificial IntelligenceArtificial IntelligenceDeep LearningReinforcement LearningSupercomputing
J
Josep.Ll Berral
Universitat Polit`ecnica de Catalunya, Barcelona, Catalonia, Spain
Anisa Halimi
Anisa Halimi
IBM Research
privacy and securitysocial networksbig data
K
Kieran Fraser
IBM Research Ireland