🤖 AI Summary
This work addresses the challenges in evaluating humor generation by large language models (LLMs)—notably subjectivity, mechanistic complexity, and the absence of a universal benchmark—by proposing a tournament-based evaluation framework. The approach uniquely integrates the General Theory of Verbal Humor (GTVH) into LLM assessment, leveraging theory-guided pairwise preference judgments aggregated via the Bradley-Terry model to produce interpretable, stable, and cross-model global rankings. Experiments on the SemEval-2026 MWAHAHA and Humor Transfer Bench datasets demonstrate high inter-rater consistency, with a Kendall’s τ of 0.889, and reveal a strong correlation between a model’s humorous proficiency and its grasp of underlying comedic mechanisms.
📝 Abstract
Evaluating humor in large language models (LLMs) is an open challenge because existing approaches yield isolated, incomparable metrics rather than unified model rankings, making it difficult to track progress across systems. We introduce HumorRank, a tournament-based evaluation framework and leaderboard for textual humor generation. Using SemEval-2026 MWAHAHA test dataset, we conduct an extensive automated pairwise evaluation across nine models spanning proprietary, open-weight, and specialized systems. Pairwise judgments grounded in the General Theory of Verbal Humor (GTVH) are aggregated via an Adaptive Swiss tournament, with Bradley-Terry Maximum Likelihood Estimation (MLE) producing globally consistent humor generation capability rankings. Our results demonstrate that HumorRank yields statistically grounded model stratifications, showing that humor quality is driven by mastery of comedic mechanisms rather than model scale alone. HumorRank thus provides a scalable, interpretable methodology for benchmarking and understanding LLM-generated humor.