Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates disparities in large language models’ (LLMs) comprehension across diverse humor genres—specifically puns, internet memes, current-events jokes, and pop-culture jokes—with emphasis on humor requiring dynamic, embodied real-world knowledge beyond static commonsense reasoning. To address the lack of benchmark resources, we introduce the first high-quality, human-annotated dataset covering all four humor types, paired with explanatory rationales. We further propose the “World-Knowledge-Enhanced Humor Understanding” framework, challenging the prevailing computational humor paradigm’s overreliance on simplified commonsense inference. Under zero-shot evaluation across multiple state-of-the-art LLMs—including reasoning-optimized variants—results reveal systematic failures: all models exhibit significantly degraded performance on current-events and pop-culture jokes, underscoring a fundamental limitation in modeling temporally evolving, context-embedded world knowledge.

Technology Category

Application Category

📝 Abstract

Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond "common sense", rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to explain diverse humor types

Comparing humor explanation performance on puns vs topical jokes

Identifying gaps in computational humor research coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated dataset of 600 diverse joke types

Compared LLMs' zero-shot humor explanation abilities

Highlighted gaps in computational humor research

🔎 Similar Papers

No similar papers found.

Authors to Follow