🤖 AI Summary
This work addresses the degradation in classification performance of physiological signal models under real-world deployment conditions caused by sensor noise, motion artifacts, and train-deployment distribution shifts. To enhance robustness without requiring model retraining, the authors propose a unified test-time augmentation framework that systematically integrates 13 cross-domain augmentation strategies spanning the time, amplitude, and frequency domains, along with artifact injection. Coupled with Bayesian optimization for automatic hyperparameter tuning, this approach yields a general, model-agnostic enhancement scheme. Evaluated on PPG-based atrial fibrillation detection, the method achieves up to an 8.5% improvement in AUROC and a 10.6% gain in AUPRC, while selective augmentation reduces the false positive rate on non-atrial fibrillation samples by 4.4%.
📝 Abstract
Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap.
Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording.
Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets.
Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.