🤖 AI Summary
This study addresses data quality challenges in the German Organ Transplantation Registry (TxReg), where missingness, inconsistencies, and ambiguity in event-time variable selection compromise research reliability. Analyzing data from 14,954 recipients and 9,964 donors between 2006 and 2016, this work systematically characterizes conflicts and complementarities among multi-source variables, identifying 168 cross-verifiable fields. By integrating missingness pattern analysis, decision tree modeling, and multi-source consistency checks, the study delineates the underlying missing data structure and proposes targeted imputation strategies. Findings reveal that while some tables exhibit missing rates exceeding 50%, key variables retain high imputation potential. Moreover, event-time analyses prove highly sensitive to variable selection, underscoring the need for careful curation. This work establishes a robust data foundation for future high-quality research leveraging TxReg.
📝 Abstract
This study presents an Initial Data Analysis (IDA) of the German Transplantation Registry (TxReg) data for a better data understanding and to inform future data analyses. The IDA is focusing on data on first-time kidney-only transplantations in adult recipients from deceased donors between 2006 and 2016 and refers to data from 14,954 recipients and 9,964 donors across 25 tables. Investigated aspects include missing data patterns and structure, data consistency, and availability of event time data. Results show that missing data proportions vary widely, with some tables nearly complete while others have over 50% missing values. Missing data patterns are identified using a decision tree approach. An influx and outflux analysis demonstrates that some variables have high potential for imputing missing data, while others were less suitable for imputation. We identified 168 multi-sourced variables that are reported by multiple data providers in parallel leading to discrepancies for some variables but also providing opportunities for missing data imputation. Our findings on event time data demonstrate the importance of carefully selecting the variables used for event time analyses as results will strongly depend on this selection. In summary, our findings highlight the challenges when utilizing the TxReg data for research and provide recommendations for data preprocessing and analysis in future analyses.