RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification

📅 2026-06-08

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the limited generalization and insufficient accuracy of existing unimodal models based solely on respiratory sounds for diagnosing respiratory diseases. To overcome these limitations, the authors propose a multimodal foundation model that integrates respiratory audio with textual clinical histories and symptom descriptions. The model leverages audio-text contrastive learning to achieve cross-modal alignment and combines supervised fine-tuning with zero-shot inference. Evaluated across five major respiratory diseases and seven real-world datasets, the approach demonstrates substantial performance gains: it improves AUROC by 9.15% in supervised tasks and by 20.98% in zero-shot settings. These results validate the effectiveness and strong generalization capability of the proposed multimodal fusion and contrastive alignment strategy.

📝 Abstract

Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.

Problem

Research questions and friction points this paper is trying to address.

respiratory disease identification

multimodal learning

audio-language alignment

diagnostic accuracy

generalizability

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal foundation model

contrastive audio-language alignment

respiratory disease identification