🤖 AI Summary
This work addresses the challenge of recognizing complex behaviors such as contradiction and hesitation in naturalistic videos, which manifest through subtle and multimodal conflicts. The authors propose a regularized multimodal fusion framework that incorporates a statistical textual modality to capture temporal dynamics of speech, alongside visual and acoustic features. A heterogeneous model selection mechanism calibrated by validation loss is introduced, followed by a hard-voting ensemble strategy optimized via particle swarm optimization (PSO) with a penalty term (λ=0.2) on the training-validation gap to mitigate overfitting. Experimental results demonstrate that the linguistic modality serves as the strongest individual predictor, and the proposed approach achieves a Macro F1-score of 0.7465 on an unseen test set, significantly enhancing generalization performance.
📝 Abstract
Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.