AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the susceptibility of multimodal large language models (MLLMs) to spurious audiovisual cues and hallucinations stemming from overreliance on textual priors in emotion understanding, which often leads to erroneous reasoning. To mitigate these issues, the authors propose AVEm-DPO, a preference optimization method that constructs preference pairs containing either spurious associations or hallucinated responses, augmented with a regularization term to suppress dependence on textual priors and enhance sensitivity to genuine audiovisual emotional signals. The study further introduces EmoReAlM, a novel benchmark designed to quantitatively evaluate cue-emotion alignment and hallucination in multimodal emotion tasks—an aspect previously unaddressed in the literature. Evaluated under zero-shot settings on DFEW, RAVDESS, and EMER datasets, the proposed approach achieves relative performance improvements of 6%–19% over baseline methods.

Technology Category

Application Category

📝 Abstract

Emotion understanding is essential for building socially intelligent agents. Although recent multimodal large language models have shown strong performance on this task, two key challenges remain - spurious associations between emotions and irrelevant audiovisual cues, and hallucinations of audiovisual cues driven by text priors in the language model backbone. To quantify and understand these issues, we introduce EmoReAlM, a benchmark designed to evaluate MLLMs for cue-emotion associations, hallucinations and modality agreement. We then propose AVEm-DPO, a preference optimization technique that aligns model responses with both audiovisual inputs and emotion-centric queries. Specifically, we construct preferences over responses exhibiting spurious associations or hallucinations, and audiovisual input pairs guided by textual prompts. We also include a regularization term that penalizes reliance on text priors, thereby mitigating modality-specific cue hallucinations. Experimental results on DFEW, RAVDESS and EMER demonstrate that our method significantly improves the performance of the reference baseline models with 6-19% of relative performance gains in zero-shot settings. By providing both a rigorous benchmark and a robust optimization framework, this work enables principled evaluation and improvement of MLLMs for emotion understanding and social AI. Code, models and benchmark will be released at https://avere-iclr.github.io.

Problem

Research questions and friction points this paper is trying to address.

audiovisual emotion reasoning

spurious associations

hallucinations

multimodal large language models

emotion understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

preference optimization

multimodal large language models

emotion reasoning