Large Language Models for Missing Data Imputation: Understanding Behavior, Hallucination Effects, and Control Mechanisms

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses critical gaps in existing missing data imputation methods—particularly their limited scalability, lack of cross-model comparability, and insufficient systematic evaluation across diverse missingness mechanisms (MCAR, MAR, MNAR). For the first time, it conducts a large-scale comparison of five large language models (e.g., Gemini 3.0 Flash, Claude 4.5 Sonnet) against six traditional approaches (e.g., MICE), evaluating their performance via zero-shot prompting across 29 real-world and synthetic datasets under varying missing rates and mechanisms. The findings reveal that large language models leverage semantic priors rather than statistical reconstruction, significantly outperforming conventional methods on real data but underperforming on synthetic data. Despite achieving higher imputation quality, these models incur substantially greater computational and economic costs.

Technology Category

Application Category

📝 Abstract
Data imputation is a cornerstone technique for handling missing values in real-world datasets, which are often plagued by missingness. Despite recent progress, prior studies on Large Language Models-based imputation remain limited by scalability challenges, restricted cross-model comparisons, and evaluations conducted on small or domain-specific datasets. Furthermore, heterogeneous experimental protocols and inconsistent treatment of missingness mechanisms (MCAR, MAR, and MNAR) hinder systematic benchmarking across methods. This work investigates the robustness of Large Language Models for missing data imputation in tabular datasets using a zero-shot prompt engineering approach. To this end, we present a comprehensive benchmarking study comparing five widely used LLMs against six state-of-the-art imputation baselines. The experimental design evaluates these methods across 29 datasets (including nine synthetic datasets) under MCAR, MAR, and MNAR mechanisms, with missing rates of up to 20\%. The results demonstrate that leading LLMs, particularly Gemini 3.0 Flash and Claude 4.5 Sonnet, consistently achieve superior performance on real-world open-source datasets compared to traditional methods. However, this advantage appears to be closely tied to the models' prior exposure to domain-specific patterns learned during pre-training on internet-scale corpora. In contrast, on synthetic datasets, traditional methods such as MICE outperform LLMs, suggesting that LLM effectiveness is driven by semantic context rather than purely statistical reconstruction. Furthermore, we identify a clear trade-off: while LLMs excel in imputation quality, they incur significantly higher computational time and monetary costs. Overall, this study provides a large-scale comparative analysis, positioning LLMs as promising semantics-driven imputers for complex tabular data.
Problem

Research questions and friction points this paper is trying to address.

missing data imputation
Large Language Models
tabular data
missingness mechanisms
benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Missing Data Imputation
Zero-shot Prompting
Tabular Data
Benchmarking
🔎 Similar Papers
No similar papers found.
A
Arthur Dantas Mangussi
Computer Science Division, Aeronautics Institute of Technology, Praça Marechal Eduardo Gomes, 50, São José dos Campos, 12228-900, São Paulo, Brazil; Science and Technology Institute, Federal University of São Paulo, Avenue Cesare Monsueto Giulio Lattes, 1201, São José dos Campos, 12231-280, São Paulo, Brazil
R
Ricardo Cardoso Pereira
University of Coimbra, CISUC/LASI – Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Polo II, Pinhal de Marrocos, Coimbra 3030-290, Portugal
Ana Carolina Lorena
Ana Carolina Lorena
Instituto Tecnológico de Aeronáutica
Aprendizado de MáquinaMineração de DadosMachine LearningData MiningData Science
Pedro Henriques Abreu
Pedro Henriques Abreu
Associate Professor with Habilitation, Department of Informatics Engineering, University of Coimbra
Data-Centric AIComputational IntelligenceMissing DataImbalanced Data