🤖 AI Summary
Long-term (1–2 minute) multimodal emotion recognition faces challenges in modeling dynamic cross-modal interactions and effectively fusing long video sequences with multi-channel physiological signals (e.g., EDA, ECG/PPG). To address this, we propose MVP, a lightweight attention-driven video-physiology fusion architecture. MVP introduces the first unified deep learning framework integrating a dual-stream CNN-LSTM video encoder, a time-frequency feature extraction network for physiological signals, and a cross-modal alignment module with adaptive weighted fusion. Crucially, MVP enables end-to-end co-optimization of visual and multi-channel physiological representations, substantially enhancing long-sequence modeling capability. Evaluated on standard benchmarks, MVP achieves a 4.2–6.8% absolute accuracy improvement over state-of-the-art methods under the joint video+EDA+ECG/PPG modality. Comprehensive experiments further validate its robustness and generalizability across diverse subjects and recording conditions.
📝 Abstract
Human emotions entail a complex set of behavioral, physiological and cognitive changes. Current state-of-the-art models fuse the behavioral and physiological components using classic machine learning, rather than recent deep learning techniques. We propose to fill this gap, designing the Multimodal for Video and Physio (MVP) architecture, streamlined to fuse video and physiological signals. Differently then others approaches, MVP exploits the benefits of attention to enable the use of long input sequences (1-2 minutes). We have studied video and physiological backbones for inputting long sequences and evaluated our method with respect to the state-of-the-art. Our results show that MVP outperforms former methods for emotion recognition based on facial videos, EDA, and ECG/PPG.