🤖 AI Summary
Large language models (LLMs) exhibit a previously overlooked deficiency in humor understanding within professional settings—a critical bottleneck in value alignment. Method: We introduce the first industry-oriented professional humor dataset, comprising humorous utterances annotated with multidimensional appropriateness labels. We propose a context-sensitive appropriateness evaluation framework, integrating human annotation with zero-shot and few-shot automated assessment to systematically benchmark five state-of-the-art LLMs. Results: All models underperform significantly relative to human annotators in judging humor appropriateness (average accuracy deficit of 28.6%), revealing fundamental gaps in modeling implicit workplace contexts—particularly power dynamics, role boundaries, and organizational norms. This work pioneers the integration of humor understanding into professional-domain LLM evaluation, establishing both a novel dimension for value alignment assessment and a foundational benchmark resource for future research.
📝 Abstract
With the recent advances in Artificial Intelligence (AI) and Large Language Models (LLMs), the automation of daily tasks, like automatic writing, is getting more and more attention. Hence, efforts have focused on aligning LLMs with human values, yet humor, particularly professional industrial humor used in workplaces, has been largely neglected. To address this, we develop a dataset of professional humor statements along with features that determine the appropriateness of each statement. Our evaluation of five LLMs shows that LLMs often struggle to judge the appropriateness of humor accurately.