Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limitation of existing large vision-language models, which often interpret internet memes only at a literal level and struggle to capture the author’s pragmatic intent. To overcome this, the paper reframes meme understanding as a disentanglement problem between literal content and pragmatic intent, introducing the Intent Projection framework—the first approach to achieve orthogonal decomposition of these two representations within a single model. The framework employs an orthogonal projection module, structured reasoning chains, a surface sentiment classifier, and a contrastive reward mechanism to effectively separate literal and pragmatic signals at the representation, output, and training-objective levels, thereby preventing literal information from contaminating pragmatic interpretation. Experiments demonstrate that the method significantly outperforms open-source baselines across six benchmarks, substantially narrowing the performance gap with closed-source models, particularly on highly ambiguous or contentious memes.

📝 Abstract

When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details contaminate the final response. We reframe meme understanding as a problem of literal-pragmatic decomposition and propose \textbf{Intent Projection}, a framework that separates the two signals at the representation, output, and objective levels within a single LVLM backbone. At the representation level, an orthogonal projection module removes dominant unimodal directions from the fused image-text representation, retaining only the pragmatic residual, while a surface-real affect classifier anchors the decoder with a discrete tag that names the polarity gap. At the output level, the model externalizes a structured reasoning chain, and at the objective level a contrastive reward explicitly penalizes answers that restate the literal description. Across six multimodal benchmarks, Intent Projection consistently outperforms open-source baselines and narrows the gap to proprietary models, with the largest gains on high-divergence posts where literal collapse is most damaging.

Problem

Research questions and friction points this paper is trying to address.

pragmatic intent

multimodal meme understanding

literal-pragmatic decomposition

large vision language models

sarcasm interpretation

Innovation

Methods, ideas, or system contributions that make the work stand out.

pragmatic intent

literal-pragmatic decomposition

orthogonal projection