🤖 AI Summary
To address the challenge of limited annotated medical text data—particularly in oncology—that constrains classification performance, this paper introduces a high-quality, four-class dataset comprising 1,874 cancer-related abstracts (thyroid cancer, colorectal cancer, lung cancer, and general biomedical topics). We propose the Residual Graph Attention Network (R-GAT), the first model to jointly integrate graph-structured representation learning with residual learning for capturing both semantic associations among medical terms and local–global contextual dependencies. The model fuses heterogeneous embeddings—TF-IDF, Word2Vec, and BERT—and is rigorously evaluated against CNN, LSTM, BERT-based models (including BioBERT and ClinicalBERT), and ensemble baselines. On the four-class classification task, R-GAT achieves F1-scores of 0.98, 0.95, 0.97, and 0.95, respectively—significantly outperforming state-of-the-art biomedical language models. These results validate a novel, sample-efficient paradigm for medical text classification with strong generalizability.
📝 Abstract
Accurate classification of cancer-related medical abstracts is crucial for healthcare management and research. However, obtaining large, labeled datasets in the medical domain is challenging due to privacy concerns and the complexity of clinical data. This scarcity of annotated data impedes the development of effective machine learning models for cancer document classification. To address this challenge, we present a curated dataset of 1,874 biomedical abstracts, categorized into thyroid cancer, colon cancer, lung cancer, and generic topics. Our research focuses on leveraging this dataset to improve classification performance, particularly in data-scarce scenarios. We introduce a Residual Graph Attention Network (R-GAT) with multiple graph attention layers that capture the semantic information and structural relationships within cancer-related documents. Our R-GAT model is compared with various techniques, including transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT), RoBERTa, and domain-specific models like BioBERT and Bio+ClinicalBERT. We also evaluated deep learning models (CNNs, LSTMs) and traditional machine learning models (Logistic Regression, SVM). Additionally, we explore ensemble approaches that combine deep learning models to enhance classification. Various feature extraction methods are assessed, including Term Frequency-Inverse Document Frequency (TF-IDF) with unigrams and bigrams, Word2Vec, and tokenizers from BERT and RoBERTa. The R-GAT model outperforms other techniques, achieving precision, recall, and F1 scores of 0.99, 0.97, and 0.98 for thyroid cancer; 0.96, 0.94, and 0.95 for colon cancer; 0.96, 0.99, and 0.97 for lung cancer; and 0.95, 0.96, and 0.95 for generic topics.