🤖 AI Summary
This study addresses the challenge of automatically localizing the onset and offset of cough sounds in large-scale tuberculosis (TB) screening by proposing an end-to-end cough detection method based on a pretrained audio Transformer. Leveraging only the first three layers of the XLS-R model, the approach achieves efficient cough segmentation on real-world patient audio collected from communities in South Africa and Uganda, significantly reducing computational and memory requirements to enable deployment on smartphones. Experimental results demonstrate that XLS-R attains an average precision of 0.96 and a ROC-AUC of 0.99 on the test set, outperforming AST and logistic regression baselines by 9% and 27%, respectively. Moreover, a TB classifier trained on cough segments automatically extracted by this method achieves performance comparable to that using manually annotated segments, providing the first validation of end-to-end automatic cough segmentation for real-world TB screening.
📝 Abstract
The automatic identification of cough segments in audio through the determination of start and end points is pivotal to building scalable screening tools in health technologies for pulmonary related diseases. We propose the application of two current pre-trained architectures to the task of cough activity detection. A dataset of recordings containing cough from patients symptomatic for tuberculosis (TB) who self-present at community-level care centres in South Africa and Uganda is employed. When automatic start and end points are determined using XLS-R, an average precision of 0.96 and an area under the receiver-operating characteristic of 0.99 are achieved for the test set. We show that best average precision is achieved by utilising only the first three layers of the network, which has the dual benefits of reduced computational and memory requirements, pivotal for smartphone-based applications. This XLS-R configuration is shown to outperform an audio spectrogram transformer (AST) as well as a logistic regression baseline by 9% and 27% absolute in test set average precision respectively. Furthermore, a downstream TB classification model trained using the coughs automatically isolated by XLS-R comfortably outperforms a model trained on the coughs isolated by AST, and is only narrowly outperformed by a classifier trained on the ground truth coughs. We conclude that the application of large pre-trained transformer models is an effective approach to identifying cough end-points and that the integration of such a model into a screening tool is feasible.