🤖 AI Summary
This study addresses the challenge of detecting deliberately crafted “bad humor” in English—a genre where state-of-the-art humor detection models exhibit significant performance degradation. To tackle this, we introduce the first bad-humor corpus derived from the Bulwer-Lytton Fiction Contest, systematically analyzing its structural patterns involving puns, irony, metaphor, and metafictional devices. We conduct the first human–LLM comparative study on bad-humor generation, revealing that LLMs over-rely on specific rhetorical devices and nonce collocations, exposing a rhetorical control bias. Integrating literary rhetorical analysis, controllable prompt engineering, and human–AI collaborative evaluation, we demonstrate that current models lack robust semantic–stylistic disentanglement capabilities for low-quality humor. Our contributions include: (1) a novel, manually annotated bad-humor benchmark; (2) empirical evidence of LLMs’ rhetorical limitations; and (3) open-sourced data and code to advance computational humor research.
📝 Abstract
Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand"bad"humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton