π€ AI Summary
Existing remote photoplethysmography (rPPG) methods perform well under controlled, ideal lighting but exhibit severe robustness degradation in real-world outdoor scenariosβe.g., intense sunlight, dynamic shadows, and motion-induced occlusions. To address this, we propose the first lightweight, end-to-end video Transformer tailored for natural outdoor environments. Our method introduces three core innovations: (1) a global interference sharing mechanism to jointly model and suppress heterogeneous disturbances; (2) subject-background reference modeling to disentangle physiological signals from contextual artifacts; and (3) self-supervised representation disentanglement to enhance physiological specificity. Further, we integrate spatiotemporal filtering, reconstruction-guided learning, frequency-domain constraints, and hemodynamic biophysical priors to achieve high-fidelity extraction of subtle cardiac pulsations. Evaluated on a multi-source, cross-scenario benchmark, our approach surpasses state-of-the-art methods by 12.6% in accuracy, supports efficient edge deployment, and demonstrates strong generalization and practical utility.
π Abstract
Physiological activities can be manifested by the sensitive changes in facial imaging. While they are barely observable to our eyes, computer vision manners can, and the derived remote photoplethysmography (rPPG) has shown considerable promise. However, existing studies mainly rely on spatial skin recognition and temporal rhythmic interactions, so they focus on identifying explicit features under ideal light conditions, but perform poorly in-the-wild with intricate obstacles and extreme illumination exposure. In this paper, we propose an end-to-end video transformer model for rPPG. It strives to eliminate complex and unknown external time-varying interferences, whether they are sufficient to occupy subtle biosignal amplitudes or exist as periodic perturbations that hinder network training. In the specific implementation, we utilize global interference sharing, subject background reference, and self-supervised disentanglement to eliminate interference, and further guide learning based on spatiotemporal filtering, reconstruction guidance, and frequency domain and biological prior constraints to achieve effective rPPG. To the best of our knowledge, this is the first robust rPPG model for real outdoor scenarios based on natural face videos, and is lightweight to deploy. Extensive experiments show the competitiveness and performance of our model in rPPG prediction across datasets and scenes.