🤖 AI Summary
This study investigates whether human-provided clinical rationales can enhance both performance and interpretability of Transformer models for clinical text classification. Using nearly 100,000 electronic pathology reports augmented with manually annotated clinical rationales, we employ a joint training paradigm and quantify interpretability via token-level rationale coverage. Our systematic experiments reveal: (1) Rationales as auxiliary supervision yield only marginal and unstable accuracy gains—substantially inferior to simply expanding the pathology report corpus; (2) Their contribution to model interpretability remains limited; (3) We propose, for the first time, a pre-screening strategy for rationales based on sufficiency metrics, thereby delineating their effective applicability boundary. Collectively, these findings expose the practical limitations of clinical rationales as supervisory signals in supervised learning, establishing an empirical benchmark and methodological guidance for leveraging human rationale data in explainable AI for healthcare.
📝 Abstract
AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don't consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.