🤖 AI Summary
This study addresses the challenge of fine-grained electrocardiogram (ECG) classification in non-English clinical settings—particularly Japanese—where diagnostic labels are highly heterogeneous and linguistically distinct. We propose the first contrastive learning-based multimodal framework tailored to nearly 100 real-world Japanese diagnostic labels. Our method jointly optimizes an ECG time-series encoder and a Japanese medical text encoder to achieve cross-modal representation alignment. Crucially, this work is the first to validate contrastive learning on fine-grained, multi-class, real-world Japanese clinical labels—overcoming prior limitations confined to English-language data and coarse-grained classification. Evaluated on a 98-class Japanese ECG classification task, our approach achieves accuracy comparable to state-of-the-art English benchmarks, demonstrating strong generalization capability and practical feasibility in cross-lingual, high-clinical-heterogeneity scenarios.
📝 Abstract
The electrocardiogram (ECG) is a fundamental tool in cardiovascular diagnostics due to its powerful and non-invasive nature. One of the most critical usages is to determine whether more detailed examinations are necessary, with users ranging across various levels of expertise. Given this diversity in expertise, it is essential to assist users to avoid critical errors. Recent studies in machine learning have addressed this challenge by extracting valuable information from ECG data. Utilizing language models, these studies have implemented multimodal models aimed at classifying ECGs according to labeled terms. However, the number of classes was reduced, and it remains uncertain whether the technique is effective for languages other than English. To move towards practical application, we utilized ECG data from regular patients visiting hospitals in Japan, maintaining a large number of Japanese labels obtained from actual ECG readings. Using a contrastive learning framework, we found that even with 98 labels for classification, our Japanese-based language model achieves accuracy comparable to previous research. This study extends the applicability of multimodal machine learning frameworks to broader clinical studies and non-English languages.