Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the feasibility and generalizability of pretrained audio models—specifically Audio-MAE and the PANN series—for COVID-19 detection from respiratory/cough sounds, addressing the critical challenge of spurious correlations between demographic attributes (e.g., age, sex) and disease status. To mitigate confounding bias, we propose a rigorous stratified evaluation framework: samples are stratified by age and sex on the Coswara and COUGHVID datasets; spectrogram-based fine-tuning and cross-dataset validation are employed, with performance quantified via AUC and F1-score. Results show Audio-MAE achieves 0.82 AUC and 0.76 F1 on Coswara, but performance degrades substantially on COUGHVID and in cross-dataset settings—revealing that post-stratification sample scarcity, not model architecture, fundamentally limits generalizability. This work is the first to systematically expose demographic bias risks and small-sample bottlenecks in audio-based COVID-19 detection, establishing a methodological benchmark for clinically trustworthy deployment.

Technology Category

Application Category

📝 Abstract
This technical report investigates the performance of pre-trained audio models on COVID-19 detection tasks using established benchmark datasets. We fine-tuned Audio-MAE and three PANN architectures (CNN6, CNN10, CNN14) on the Coswara and COUGHVID datasets, evaluating both intra-dataset and cross-dataset generalization. We implemented a strict demographic stratification by age and gender to prevent models from exploiting spurious correlations between demographic characteristics and COVID-19 status. Intra-dataset results showed moderate performance, with Audio-MAE achieving the strongest result on Coswara (0.82 AUC, 0.76 F1-score), while all models demonstrated limited performance on Coughvid (AUC 0.58-0.63). Cross-dataset evaluation revealed severe generalization failure across all models (AUC 0.43-0.68), with Audio-MAE showing strong performance degradation (F1-score 0.00-0.08). Our experiments demonstrate that demographic balancing, while reducing apparent model performance, provides more realistic assessment of COVID-19 detection capabilities by eliminating demographic leakage - a confounding factor that inflate performance metrics. Additionally, the limited dataset sizes after balancing (1,219-2,160 samples) proved insufficient for deep learning models that typically require substantially larger training sets. These findings highlight fundamental challenges in developing generalizable audio-based COVID-19 detection systems and underscore the importance of rigorous demographic controls for clinically robust model evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating pre-trained audio models for COVID-19 detection using benchmark datasets
Assessing cross-dataset generalization and demographic bias in COVID detection
Investigating performance limitations of audio-based COVID detection systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned pre-trained audio models for COVID detection
Implemented demographic stratification to prevent spurious correlations
Evaluated cross-dataset generalization revealing performance limitations
🔎 Similar Papers
No similar papers found.