XMeCap: Meme Caption Generation with Sub-Image Adaptability

📅 2024-07-24

🏛️ ACM Multimedia

📈 Citations: 11

✨ Influential: 2

🤖 AI Summary

This work addresses the fine-grained modeling challenge in multimodal humor understanding—specifically, caption generation for multi-image memes. We propose a vision-language co-grounded approach that operates at the sub-image level. Our method introduces a novel multi-granularity reward model that jointly captures global and local image-text similarity, integrating supervised fine-tuning, reinforcement learning, cross-modal alignment, and sub-image attention mechanisms. This design significantly enhances the model’s capacity to jointly represent sub-image semantics and humorous logic in multi-image meme contexts. Experiments demonstrate state-of-the-art performance: 75.85 and 66.32 points on single-image and multi-image meme captioning benchmarks, respectively—outperforming the strongest baseline by 3.71% and 4.82%. Moreover, the approach exhibits markedly improved cross-category generalization.

Technology Category

Application Category

📝 Abstract

Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on the impact of multi-images on meme captioning. After that, we introduce the XMeCap framework, a novel approach that adopts supervised fine-tuning and reinforcement learning based on an innovative reward model, which factors in both global and local similarities between visuals and text. Our results, benchmarked against contemporary models, manifest a marked improvement in caption generation for both single-image and multi-image memes, as well as different meme categories. XMeCap achieves an average evaluation score of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming the best baseline by 3.71% and 4.82%, respectively. This research not only establishes a new frontier in meme-related studies but also underscores the potential of machines in understanding and generating humor in a multi-modal setting.

Problem

Research questions and friction points this paper is trying to address.

Challenges in generating humorous meme captions for machines

Impact of multi-images on meme captioning performance

Improving multi-modal humor understanding and generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervised fine-tuning for meme captioning

Reinforcement learning with innovative reward model

Global and local visual-text similarity integration

🔎 Similar Papers

No similar papers found.

Authors to Follow