Diffusion-Based Heart Sound Generation: Evaluation with Physiological Signal Metrics, Classifiers, and Expert Listening

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This study addresses the limitations of existing publicly available phonocardiogram (PCG) datasets, which suffer from small scale and limited pathological diversity, thereby hindering the generalization of auscultation training and automated classification models. The authors propose the first application of a class-conditional diffusion model for PCG synthesis, training a 2D U-Net with classifier-free guidance on standardized 1×128×128 log-mel spectrograms. The generated signals are rigorously evaluated through multiple dimensions, including physiological plausibility metrics (e.g., envelope autocorrelation rhythm score and burst amplitude score), downstream classification performance, and clinical expert listening tests. Results demonstrate that the synthetic data effectively preserve discriminative structures between normal and abnormal heart sounds, enabling a ResNet-50 classifier trained solely on synthetic data to achieve 82.8% accuracy. Clinical experts further confirm that most generated samples exhibit characteristic heart sound features, validating the method’s feasibility and novelty.

📝 Abstract

Publicly available phonocardiogram (PCG) datasets remain limited in size and pathological diversity, constraining both auscultation training and the generalisation of automated heart-sound classifiers. A class-conditional diffusion model for PCG generation is developed in the log-mel domain and synthetic fidelity is assessed using complementary (i) physiology-inspired plausibility metrics, (ii) downstream label-consistency evaluation, and (iii) expert listening. Experiments use the Phy-sioNet/Computing in Cardiology Challenge 2016 dataset (3240 recordings) with recording-level splits. After preprocessing and quality control, 16,749 non-overlapping 4 s clips are mapped to a normalised 1 x 128 x 128 log-mel representation to train a conditional 2D U-Net denoiser with classifier-free guidance. Signal-level plausibility is quantified on reconstructed waveforms using three lightweight metrics: an envelope-autocorrelation rhythm score, an amplitude-based explosion score, and the dominant cycle lag. Synthetic clips preserve similar dominant cycle durations but exhibit reduced envelope periodicity and increased transient burstiness relative to real clips. For downstream evaluation, a ResNet-50 classifier achieves 92.24% accuracy on the held-out real test set and 82.8% accuracy on class-balanced synthetic batches, indicating that generated signals retain discriminative structure relevant to normal/abnormal classification. In a pilot expert listening study (60 clips, two clinicians), most synthetic clips are judged as heart-sound-like, while abnormality sensitivity is low for both real and synthetic 4 s excerpts. Overall, the results provide a practical baseline for diffusion-based PCG generation while highlighting remaining challenges in retaining abnormal acoustic cues and reducing reconstruction-induced artefacts.

Problem

Research questions and friction points this paper is trying to address.

phonocardiogram

data scarcity

pathological diversity

heart sound classification

auscultation training

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model

phonocardiogram synthesis

log-mel spectrogram