Large language models surpass domain-specific architectures for antepartum electronic fetal monitoring analysis

📅 2025-09-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Traditional domain-specific models for electronic fetal monitoring (EFM/CTG) analysis suffer from poor generalizability and strong data dependency. Method: We propose the first unified benchmark framework encompassing 15 models—including large language models (LLMs), temporal models, and domain-specific models—evaluated on a dataset of over 2,500 twenty-minute, multichannel CTG recordings under varying data conditions. Contribution/Results: Fine-tuned LLMs significantly outperform state-of-the-art domain-specific models in multiclass CTG classification, especially when integrating multimodal signal inputs; however, they incur substantially higher computational overhead, necessitating a trade-off between inference efficiency and accuracy. This work provides the first empirical validation of LLMs’ potential in perinatal intelligent monitoring and establishes a reproducible benchmark and methodological foundation for medical time-series foundation modeling.

Technology Category

Application Category

📝 Abstract

Foundation models (FMs) and large language models (LLMs) have demonstrated promising generalization across diverse domains for time-series analysis, yet their potential for electronic fetal monitoring (EFM) and cardiotocography (CTG) analysis remains underexplored. Most existing CTG studies relied on domain-specific models and lack systematic comparisons with modern foundation or language models, limiting our understanding of whether these models can outperform specialized systems in fetal health assessment. In this study, we present the first comprehensive benchmark of state-of-the-art architectures for automated antepartum CTG classification. Over 2,500 20-minutes recordings were used to evaluate over 15 models spanning domain-specific, time-series, foundation, and language-model categories under a unified framework. Fine-tuned LLMs consistently outperformed both foundation and domain-specific models across data-availability scenarios, except when uterine-activity signals were absent, where domain-specific models showed greater robustness. These performance gains, however, required substantially higher computational resources. Our results highlight that while fine-tuned LLMs achieved state-of-the-art performance for CTG classification, practical deployment must balance performance with computational efficiency.

Problem

Research questions and friction points this paper is trying to address.

Evaluating large language models for electronic fetal monitoring analysis

Comparing foundation models with domain-specific CTG classification systems

Assessing computational efficiency trade-offs in fetal health assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLMs outperform domain-specific CTG models

Comprehensive benchmark evaluates 15 models systematically

Balancing performance with computational efficiency for deployment

🔎 Similar Papers

No similar papers found.

Authors to Follow