Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

📅 2026-06-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for affective video captioning rely on holistic visual features, which often fail to precisely localize the specific segments responsible for eliciting emotions, resulting in redundant descriptions and inaccurate emotion attribution. To address this limitation, this work proposes a two-stage fine-grained emotion-cause pair extraction framework that introduces, for the first time, a concept-aware visual semantic decomposition module and a vision-guided explainable emotion learning mechanism. By incorporating Valence-Arousal-Dominance (VAD) vector constraints and cross-modal contrastive alignment, the model achieves precise coupling between emotions and their visual causes. Evaluated on three benchmarks including EVC-MSVD, the proposed approach outperforms state-of-the-art methods, yielding absolute improvements of 4.4% in BLEU-2 and 5.4% in ROUGE-L scores.
📝 Abstract
Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.
Problem

Research questions and friction points this paper is trying to address.

Emotional Video Captioning
emotion-cause pair
fine-grained extraction
visual emotion
motivational causes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion-Cause Pair Extraction
Fine-grained Video Understanding
Visual-guided Emotion Learning
VAD-vector Constraints
Contrastive Semantic Alignment
W
Weidong Chen
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
C
Cheng Ye
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China
Zhendong Mao
Zhendong Mao
University of Science and Technology of China
CV,NLP
Liping Wang
Liping Wang
Institute of Neuroscience, Chinese Academy of Sciences
sequence learningworking memorybodily self-consciousnesscausal inference
X
Xinyan Liu
School of Information Science and Technology, Harbin Institute of Technology (Weihai), Weihai 264209, China
Y
Yongdong Zhang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230027, China; and Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230027, China