🤖 AI Summary
This work addresses the challenge of scarce and costly labeled data in domains such as healthcare by proposing a correlation-guided masking strategy for self-supervised learning. The approach integrates correlation attribution techniques with a decoder-only architecture to identify task-relevant segments within unlabeled text and reconstruct them through masked language modeling, thereby effectively transferring structural and semantic knowledge to downstream classification tasks. Evaluated across 136 clinical text classification tasks, the method demonstrates substantial improvements over strong baselines, outperforming supervised fine-tuning by +19.9 Macro F1, synthetic label generation by +12.5, and continued pretraining by +6.3, confirming its efficacy and generalization capability.
📝 Abstract
Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.