🤖 AI Summary
This study addresses key challenges in modeling how visual content elicits pleasure through cognitive appraisal—namely, label noise, semantic ambiguity, data scarcity, and model opacity. To bridge the semantic gap between general "positive emotion" and the specific affective state of "pleasure," this work proposes the first interpretable pleasure prediction model grounded in cognitive appraisal theory within a multimodal affective computing framework. The approach integrates Transformers with attention mechanisms to extract fine-grained multimodal features and incorporates fuzzy logic to explicitly model cognitive appraisal variables. Evaluated on a video-based pleasure induction task, the model achieves a peak accuracy of 0.6624, demonstrating significant improvements in both predictive performance and interpretability compared to existing methods.
📝 Abstract
Multimodal affective computing analyzes user-generated social media content to predict emotional states. However, a critical gap remains in understanding how visual content shapes cognitive interpretations and elicits specific affective experiences such as pleasure. This study introduces a novel computational model to infer video-induced pleasure via cognitive appraisal variables. The proposed model addresses four challenges: (1) noisy and inconsistent human labels, (2) the semantic gap between "positive emotions" and "pleasure," (3) the scarcity of pleasure-specific datasets, and (4) the limited interpretability of existing black-box fusion methods. Our approach integrates data-driven and cognitive theory-driven methods, using cognitive appraisal theory and a fuzzy model within an innovative framework. The model employs transformer-based architectures and attention mechanisms for fine-grained multimodal feature extraction and interpretable fusion to capture both inter- and intra-modal dynamics associated with pleasure. This enables the prediction of underlying appraisal variables, thereby bridging the semantic gap and enhancing model explainability beyond conventional statistical associations. Experimental results validate the efficacy of the proposed method in detecting video-induced pleasure, achieving a peak accuracy of 0.6624 in predicting pleasure levels. These findings highlight promising implications for affective content recommendation, intelligent media creation, and advancing our understanding of how digital media influences human emotions.