🤖 AI Summary
This work addresses the inflated performance of clinical NLP models caused by temporal and lexical leakage, which poses serious risks to real-world deployment safety. To mitigate this, the authors propose a lightweight auditing framework that integrates interpretability mechanisms early into the model development pipeline, systematically ensuring temporal validity, probability calibration, and behavioral robustness. By jointly leveraging temporal leakage detection and interpretability analysis, the framework effectively curbs the model’s reliance on spurious cues—such as discharge-related vocabulary—that do not reflect genuine clinical signals. Experimental results demonstrate that audited models produce more conservative and well-calibrated prediction probabilities, significantly enhancing clinical reliability and safety without compromising overall performance.
📝 Abstract
Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.