🤖 AI Summary
Missing data imputation in biostatistical analysis often faces a trade-off between accuracy and computational efficiency. This paper proposes MissARF, a generative imputation method based on Adversarial Random Forests (ARF), supporting both single and multiple imputation. Its core innovation lies in leveraging ARF to efficiently model high-dimensional conditional distributions and directly generating imputed values via conditional sampling; for multiple imputation, the same ARF model is reused without additional computational overhead. Experiments across diverse real-world and synthetic datasets demonstrate that MissARF achieves imputation accuracy comparable to state-of-the-art methods—including MICE and GAIN—while accelerating runtime by one to two orders of magnitude. The efficiency gain is especially pronounced for multiple imputation. Moreover, MissARF exhibits strong usability and scalability, making it suitable for large-scale biostatistical applications.
📝 Abstract
Handling missing values is a common challenge in biostatistical analyses, typically addressed by imputation methods. We propose a novel, fast, and easy-to-use imputation method called missing value imputation with adversarial random forests (MissARF), based on generative machine learning, that provides both single and multiple imputation. MissARF employs adversarial random forest (ARF) for density estimation and data synthesis. To impute a missing value of an observation, we condition on the non-missing values and sample from the estimated conditional distribution generated by ARF. Our experiments demonstrate that MissARF performs comparably to state-of-the-art single and multiple imputation methods in terms of imputation quality and fast runtime with no additional costs for multiple imputation.