🤖 AI Summary
Existing methods for affective video captioning rely on holistic visual features, which often fail to precisely localize the specific segments responsible for eliciting emotions, resulting in redundant descriptions and inaccurate emotion attribution. To address this limitation, this work proposes a two-stage fine-grained emotion-cause pair extraction framework that introduces, for the first time, a concept-aware visual semantic decomposition module and a vision-guided explainable emotion learning mechanism. By incorporating Valence-Arousal-Dominance (VAD) vector constraints and cross-modal contrastive alignment, the model achieves precise coupling between emotions and their visual causes. Evaluated on three benchmarks including EVC-MSVD, the proposed approach outperforms state-of-the-art methods, yielding absolute improvements of 4.4% in BLEU-2 and 5.4% in ROUGE-L scores.
📝 Abstract
Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.