🤖 AI Summary
This work addresses the limitation of existing natural language inference (NLI) benchmarks, which predominantly rely on Western contexts or translated data and thus inadequately assess models’ reasoning capabilities in Southeast Asian cultural settings. To bridge this gap, the authors introduce SEA-NLI—the first natively constructed multilingual NLI benchmark spanning eight countries, encompassing both English and local languages, and curated by native speakers with an emphasis on culture-specific knowledge. Evaluations across 17 prominent large language models reveal substantial performance degradation in Southeast Asian cultural contexts, particularly in knowledge-intensive categories. While culturally adapted fine-tuning and culture-aware prompting significantly improve model performance, chain-of-thought (CoT) reasoning yields only marginal gains. This study exposes systematic shortcomings of current models in non-Western cultural reasoning and establishes a new benchmark and pathway for culturally sensitive NLI research.
📝 Abstract
Frontier LLMs perform well in Western contexts, but remain poorly tested on underrepresented cultures such as those in Southeast Asia (SEA). Existing NLI benchmarks are largely Western-centric, translation-derived, or monolingual, limiting their ability to measure culturally grounded reasoning. We introduce SEA-NLI, a native, culturally grounded NLI benchmark covering eight SEA countries in English and native regional languages, verified by native speakers. Across 17 encoder and decoder models, we observe a low performance from all models, especially for knowledge-intensive categories such as Languages and Science and Technology. Our analysis shows that failure cases mainly stem from missing SEA cultural knowledge: SEA-adapted models and culture-aware prompting improve performance, while CoT prompting offers limited gains.