🤖 AI Summary
Existing research lacks a multimodal video dataset specifically designed for ambivalence/hesitancy (A/H) emotion recognition. Method: We introduce BAH—the first open-source, multimodal video dataset tailored to behavior change contexts—comprising 1,118 videos from 224 Canadian participants and 1.5 hours of fine-grained, dual-granularity (video-level and frame-level) A/H annotations. Our pipeline integrates facial alignment, speech transcription, temporal behavioral labeling, and cross-modal feature alignment to jointly model video, audio, text, and metadata. It supports cross-modal learning, zero-shot transfer, and personalized unsupervised domain adaptation. Contribution/Results: Experiments reveal substantial challenges in real-world A/H recognition; we publicly release the full dataset, code, and pretrained models—establishing foundational resources for interpretable, behavior-intervention AI systems.
📝 Abstract
Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H. This paper introduces a first Behavioural Ambivalence/Hesitancy (BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data. We include results baselines for BAH at frame- and video-level recognition in multi-modal setups, in addition to zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.