BROTHER: Behavioral Recognition Optimized Through Heterogeneous Ensemble Regularization for Ambivalence and Hesitancy

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of recognizing complex behaviors such as contradiction and hesitation in naturalistic videos, which manifest through subtle and multimodal conflicts. The authors propose a regularized multimodal fusion framework that incorporates a statistical textual modality to capture temporal dynamics of speech, alongside visual and acoustic features. A heterogeneous model selection mechanism calibrated by validation loss is introduced, followed by a hard-voting ensemble strategy optimized via particle swarm optimization (PSO) with a penalty term (λ=0.2) on the training-validation gap to mitigate overfitting. Experimental results demonstrate that the linguistic modality serves as the strongest individual predictor, and the proposed approach achieves a Macro F1-score of 0.7465 on an unseen test set, significantly enhancing generalization performance.

Technology Category

Application Category

📝 Abstract

Recognizing complex behavioral states such as Ambivalence and Hesitancy (A/H) in naturalistic video settings remains a significant challenge in affective computing. Unlike basic facial expressions, A/H manifests as subtle, multimodal conflicts that require deep contextual and temporal understanding. In this paper, we propose a highly regularized, multimodal fusion pipeline to predict A/H at the video level. We extract robust unimodal features from visual, acoustic, and linguistic data, introducing a specialized statistical text modality explicitly designed to capture temporal speech variations and behavioral cues. To identify the most effective representations, we evaluate 15 distinct modality combinations across a committee of machine learning classifiers (MLP, Random Forest, and GBDT), selecting the most well-calibrated models based on validation Binary Cross-Entropy (BCE) loss. Furthermore, to optimally fuse these heterogeneous models without overfitting to the training distribution, we implement a Particle Swarm Optimization (PSO) hard-voting ensemble. The PSO fitness function dynamically incorporates a train-validation gap penalty (lambda) to actively suppress redundant or overfitted classifiers. Our comprehensive evaluation demonstrates that while linguistic features serve as the strongest independent predictor of A/H, our heavily regularized PSO ensemble (lambda = 0.2) effectively harnesses multimodal synergies, achieving a peak Macro F1-score of 0.7465 on the unseen test set. These results emphasize that treating ambivalence and hesitancy as a multimodal conflict, evaluated through an intelligently weighted committee, provides a robust framework for in-the-wild behavioral analysis.

Problem

Research questions and friction points this paper is trying to address.

Ambivalence

Hesitancy

Behavioral Recognition

Multimodal Fusion

Affective Computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal fusion

Particle Swarm Optimization

behavioral recognition