Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging Fabricated Claims with Humorous Content

📅 2025-03-20
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how humor—particularly irony, sarcasm, and absurdity—enhances the deceptive efficacy of misinformation and facilitates its cross-lingual propagation. To address this, we introduce DHD, the first multilingual synthetic benchmark dataset specifically designed for deceptive humor detection, covering six languages (e.g., English, Hindi, Telugu) and four code-mixed variants. Methodologically, we propose a novel three-level irony intensity scale and a five-category humor taxonomy, and pioneer a “large language model generation + multi-stage human verification” synthesis paradigm, integrated with a multilingual NLP pipeline for text cleaning, code-mixing identification, and annotation alignment. DHD comprises over 10,000 high-quality samples. Fine-tuned RoBERTa and XLM-R baselines achieve 78.3%–86.1% accuracy on irony detection and humor classification—significantly outperforming zero-shot cross-lingual transfer—thereby establishing a robust foundation for multilingual deceptive humor analysis.

Technology Category

Application Category

📝 Abstract
This paper presents the Deceptive Humor Dataset (DHD), a novel resource for studying humor derived from fabricated claims and misinformation. In an era of rampant misinformation, understanding how humor intertwines with deception is essential. DHD consists of humor-infused comments generated from false narratives, incorporating fabricated claims and manipulated information using the ChatGPT-4o model. Each instance is labeled with a Satire Level, ranging from 1 for subtle satire to 3 for high-level satire and classified into five distinct Humor Categories: Dark Humor, Irony, Social Commentary, Wordplay, and Absurdity. The dataset spans multiple languages including English, Telugu, Hindi, Kannada, Tamil, and their code-mixed variants (Te-En, Hi-En, Ka-En, Ta-En), making it a valuable multilingual benchmark. By introducing DHD, we establish a structured foundation for analyzing humor in deceptive contexts, paving the way for a new research direction that explores how humor not only interacts with misinformation but also influences its perception and spread. We establish strong baselines for the proposed dataset, providing a foundation for future research to benchmark and advance deceptive humor detection models.
Problem

Research questions and friction points this paper is trying to address.

Study humor derived from fabricated claims and misinformation.
Analyze humor's role in deceptive contexts across multiple languages.
Establish benchmarks for deceptive humor detection models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

ChatGPT-4o generates humor from fabricated claims.
Dataset includes Satire Levels and Humor Categories.
Multilingual dataset spans English and Indian languages.