🤖 AI Summary
Traditional domain-specific models for electronic fetal monitoring (EFM/CTG) analysis suffer from poor generalizability and strong data dependency. Method: We propose the first unified benchmark framework encompassing 15 models—including large language models (LLMs), temporal models, and domain-specific models—evaluated on a dataset of over 2,500 twenty-minute, multichannel CTG recordings under varying data conditions. Contribution/Results: Fine-tuned LLMs significantly outperform state-of-the-art domain-specific models in multiclass CTG classification, especially when integrating multimodal signal inputs; however, they incur substantially higher computational overhead, necessitating a trade-off between inference efficiency and accuracy. This work provides the first empirical validation of LLMs’ potential in perinatal intelligent monitoring and establishes a reproducible benchmark and methodological foundation for medical time-series foundation modeling.
📝 Abstract
Foundation models (FMs) and large language models (LLMs) have demonstrated promising generalization across diverse domains for time-series analysis, yet their potential for electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis remains underexplored. Most existing CTG studies relied on domain-specific models and lack systematic comparisons with modern foundation or language models, limiting our understanding of whether these models can outperform specialized systems in fetal health assessment. In this study, we present the first comprehensive benchmark of state-of-the-art architectures for automated antepartum CTG classification. Over 2,500 20-minutes recordings were used to evaluate over 15 models spanning domain-specific, time-series, foundation, and language-model categories under a unified framework. Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent, where domain-specific models showed greater robustness. These performance gains, however, required substantially higher computational resources. Our results highlight that while fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.