🤖 AI Summary
Large language models (LLMs) exhibit significant deficiencies in understanding non-Western cultural norms—particularly Persian *taarof*, a politeness system grounded in humility, indirectness, and deference—undermining their global applicability. To address this, we introduce TaarofBench, the first benchmark explicitly designed to evaluate cultural understanding through *taarof*, systematically integrating it into AI cultural competence assessment and exposing the explanatory limitations of Western politeness theories in non-Western contexts. Our methodology employs native-speaker-validated role-playing scenarios, supervised fine-tuning, and direct preference optimization to enhance cultural alignment. Experiments across five state-of-the-art LLMs reveal accuracy deficits of 40–48% relative to Persian native speakers on TaarofBench; after optimization, cultural alignment improves by 21.8% and 42.3%, respectively, demonstrating the efficacy and cross-cultural transferability of culture-aware modeling.
📝 Abstract
Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated "polite" by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.