EmoGist: Efficient In-Context Learning for Visual Emotion Understanding

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of supervised visual emotion classification—namely, reliance on labeled data and fixed label spaces—this paper proposes a training-free in-context learning framework leveraging Large Vision-Language Models (LVLMs) for zero-shot, interpretable emotion recognition. Methodologically, it first performs unsupervised clustering over an image collection to automatically generate contextualized emotion label explanations; then, it dynamically retrieves the most semantically relevant cluster explanation for each test image via embedding similarity, guiding a lightweight LVLM to perform classification. The core contribution is the first-ever contextualized, dynamic definition mechanism for emotion labels—eliminating the need for predefined label spaces. Experiments demonstrate substantial improvements: +13.0 micro-F1 on the multi-label Memotion dataset and +8.2 macro-F1 on the multi-class FI dataset, significantly outperforming existing zero-shot approaches.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce EmoGist, a training-free, in-context learning method for performing visual emotion classification with LVLMs. The key intuition of our approach is that context-dependent definition of emotion labels could allow more accurate predictions of emotions, as the ways in which emotions manifest within images are highly context dependent and nuanced. EmoGist pre-generates multiple explanations of emotion labels, by analyzing the clusters of example images belonging to each category. At test time, we retrieve a version of explanation based on embedding similarity, and feed it to a fast VLM for classification. Through our experiments, we show that EmoGist allows up to 13 points improvement in micro F1 scores with the multi-label Memotion dataset, and up to 8 points in macro F1 in the multi-class FI dataset.
Problem

Research questions and friction points this paper is trying to address.

Improving visual emotion classification accuracy using context-dependent labels
Enhancing emotion prediction by pre-generating multiple label explanations
Boosting performance in multi-label and multi-class emotion datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free in-context learning for emotion classification
Pre-generates emotion label explanations via image clusters
Retrieves explanations by embedding similarity for VLM classification