We Politely Insist: Your LLM Must Learn the Persian Art of Taarof

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit significant deficiencies in understanding non-Western cultural norms—particularly Persian *taarof*, a politeness system grounded in humility, indirectness, and deference—undermining their global applicability. To address this, we introduce TaarofBench, the first benchmark explicitly designed to evaluate cultural understanding through *taarof*, systematically integrating it into AI cultural competence assessment and exposing the explanatory limitations of Western politeness theories in non-Western contexts. Our methodology employs native-speaker-validated role-playing scenarios, supervised fine-tuning, and direct preference optimization to enhance cultural alignment. Experiments across five state-of-the-art LLMs reveal accuracy deficits of 40–48% relative to Persian native speakers on TaarofBench; after optimization, cultural alignment improves by 21.8% and 42.3%, respectively, demonstrating the efficacy and cross-cultural transferability of culture-aware modeling.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) struggle to navigate culturally specific communication norms, limiting their effectiveness in global contexts. We focus on Persian taarof, a social norm in Iranian interactions, which is a sophisticated system of ritual politeness that emphasizes deference, modesty, and indirectness, yet remains absent from existing cultural benchmarks. We introduce TaarofBench, the first benchmark for evaluating LLM understanding of taarof, comprising 450 role-play scenarios covering 12 common social interaction topics, validated by native speakers. Our evaluation of five frontier LLMs reveals substantial gaps in cultural competence, with accuracy rates 40-48% below native speakers when taarof is culturally appropriate. Performance varies between interaction topics, improves with Persian-language prompts, and exhibits gender-based asymmetries. We also show that responses rated "polite" by standard metrics often violate taarof norms, indicating the limitations of Western politeness frameworks. Through supervised fine-tuning and Direct Preference Optimization, we achieve 21.8% and 42.3% improvement in model alignment with cultural expectations. Our human study with 33 participants (11 native Persian, 11 heritage, and 11 non-Iranian speakers) forms baselines in varying degrees of familiarity with Persian norms. This work lays the foundation for developing diverse and culturally aware LLMs, enabling applications that better navigate complex social interactions.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM understanding of Persian taarof cultural norms
Addressing performance gaps in culturally appropriate communication
Developing benchmarks for culturally aware language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced TaarofBench benchmark with 450 culturally validated scenarios
Used supervised fine-tuning to improve cultural alignment by 21.8%
Applied Direct Preference Optimization for 42.3% better cultural compliance
🔎 Similar Papers
No similar papers found.