π€ AI Summary
This work addresses the susceptibility of multimodal large language models (MLLMs) to spurious audiovisual cues and hallucinations stemming from overreliance on textual priors in emotion understanding, which often leads to erroneous reasoning. To mitigate these issues, the authors propose AVEm-DPO, a preference optimization method that constructs preference pairs containing either spurious associations or hallucinated responses, augmented with a regularization term to suppress dependence on textual priors and enhance sensitivity to genuine audiovisual emotional signals. The study further introduces EmoReAlM, a novel benchmark designed to quantitatively evaluate cue-emotion alignment and hallucination in multimodal emotion tasksβan aspect previously unaddressed in the literature. Evaluated under zero-shot settings on DFEW, RAVDESS, and EMER datasets, the proposed approach achieves relative performance improvements of 6%β19% over baseline methods.
π Abstract
Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.