Assessing the Capabilities of LLMs in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM-based humor evaluation relies excessively on a single “funniness” metric, failing to capture the multidimensional nature of humor perception. Method: We propose the first six-dimensional human-annotated framework—encompassing empathy, novelty, logicality, surprise, relatability, and timing—and systematically evaluate LLMs’ generation and self-assessment capabilities on the Japanese Oogiri improvisational comedy task, using an expanded Oogiri dataset and five-point Likert-scale annotations. Contribution/Results: Experiments show current LLMs generate humor at a level between human low and medium performers; they exhibit a strong bias toward novelty, whereas humans prioritize empathy—a fundamental misalignment that impedes faithful replication of human humor judgments. Empathy emerges as the critical bottleneck in LLMs’ humor understanding. To support future research, we publicly release our multidimensional annotation corpus, establishing a new benchmark and resource for affectively intelligent dialogue systems.

Technology Category

Application Category

📝 Abstract

Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.''This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' humor capabilities using multidimensional Oogiri comedy analysis

Assessing LLM performance in generating versus evaluating humorous content

Identifying empathy gap in LLMs' humor generation compared to humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expanded Oogiri dataset with LLM-generated responses

Multi-dimensional manual annotation using six humor criteria

Correlation analysis revealing human-LLM evaluation divergence

🔎 Similar Papers

No similar papers found.

Authors to Follow