Not Another Imputation Method: A Transformer-based Model for Missing Values in Tabular Datasets

📅 2024-07-16
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
Missing value imputation in tabular data often introduces bias and hinders end-to-end learning. This paper proposes NAIM—the first Transformer architecture that directly models incomplete tabular data without imputation. Its key contributions are: (1) a missingness-aware ternary feature embedding scheme—distinguishing categorical, numerical, and missing tokens—to explicitly encode missingness semantics; (2) a mask-based self-attention mechanism that fully excludes missing positions from attention computation, preventing information leakage; and (3) a customized regularization strategy designed specifically for incomplete data to improve generalization. Evaluated on five public tabular benchmarks, NAIM consistently outperforms nine baseline methods—including six traditional machine learning and five deep learning models, each combined with three distinct imputation strategies—achieving significant gains in both predictive accuracy and robustness to missingness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Handling missing values in tabular datasets presents a significant challenge in training and testing artificial intelligence models, an issue usually addressed using imputation techniques. Here we introduce"Not Another Imputation Method"(NAIM), a novel transformer-based model specifically designed to address this issue without the need for traditional imputation techniques. NAIM's ability to avoid the necessity of imputing missing values and to effectively learn from available data relies on two main techniques: the use of feature-specific embeddings to encode both categorical and numerical features also handling missing inputs; the modification of the masked self-attention mechanism to completely mask out the contributions of missing data. Additionally, a novel regularization technique is introduced to enhance the model's generalization capability from incomplete data. We extensively evaluated NAIM on 5 publicly available tabular datasets, demonstrating its superior performance over 6 state-of-the-art machine learning models and 5 deep learning models, each paired with 3 different imputation techniques when necessary. The results highlight the efficacy of NAIM in improving predictive performance and resilience in the presence of missing data. To facilitate further research and practical application in handling missing data without traditional imputation methods, we made the code for NAIM available at https://github.com/cosbidev/NAIM.
Problem

Research questions and friction points this paper is trying to address.

Handling missing values in tabular datasets without imputation
Developing a transformer-based model for incomplete data learning
Improving predictive performance with missing data resilience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model avoids traditional imputation techniques
Feature-specific embeddings handle missing categorical and numerical data
Modified masked self-attention ignores missing data contributions
🔎 Similar Papers
No similar papers found.
Camillo Maria Caruso
Camillo Maria Caruso
PhD student, Università Campus Bio-Medico di Roma
artificial intelligencedeep learningcomputer vision
P
P. Soda
Research Unit of Computer Systems and Bioinformatics, Department of Engineering, Università Campus Bio-Medico di Roma, Roma, Italy, Europe, and also with the Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University, Umeå, Sweden, Europe
Valerio Guarrasi
Valerio Guarrasi
Università Campus Bio-Medico di Roma, Italy
Artificial IntelligenceMachine LearningMultimodal Deep LearningGenerative AI