🤖 AI Summary
General-purpose audio foundation models (e.g., AST, PaSST, BEATs) lack systematic evaluation for clinical auscultation tasks such as heart and lung sound analysis. Method: We conduct the first cross-task benchmarking of these models on four clinically relevant tasks—heart sound classification, respiratory sound classification, abnormal breath detection, and crackle/wheeze identification—using linear probing and lightweight fine-tuning on public datasets. Performance is rigorously compared against state-of-the-art (SOTA) domain-specific models under zero-shot and fine-tuned settings. Contribution/Results: On high-quality data tasks, foundation models match or exceed SOTA accuracy—outperforming dedicated respiratory sound models by up to 8.2% absolute accuracy. However, they exhibit limited generalization under high-noise conditions. We release a standardized evaluation protocol, unified codebase, and reproducible benchmarks to advance rigorous, clinically grounded assessment of medical audio foundation models.
📝 Abstract
Pre-trained deep learning models, known as foundation models, have become essential building blocks in machine learning domains such as natural language processing and image domains. This trend has extended to respiratory and heart sound models, which have demonstrated effectiveness as off-the-shelf feature extractors. However, their evaluation benchmarking has been limited, resulting in incompatibility with state-of-the-art (SOTA) performance, thus hindering proof of their effectiveness. This study investigates the practical effectiveness of off-the-shelf audio foundation models by comparing their performance across four respiratory and heart sound tasks with SOTA fine-tuning results. Experiments show that models struggled on two tasks with noisy data but achieved SOTA performance on the other tasks with clean data. Moreover, general-purpose audio models outperformed a respiratory sound model, highlighting their broader applicability. With gained insights and the released code, we contribute to future research on developing and leveraging foundation models for respiratory and heart sounds.