🤖 AI Summary
This study addresses the limitation of existing research that reduces cultural intelligence in large language models to static knowledge acquisition, overlooking their capacity for reasoning grounded in cultural norms within real-world contexts. To bridge this gap, the authors introduce CultureForest, a novel benchmark that establishes the first verifiable evaluation framework for culturally grounded reasoning. Spanning eight domains and 53 countries, CultureForest employs atomized cultural norms, multi-stage tasks ranging from multiple-choice to open-ended generation, and a knowledge–reasoning disentanglement methodology to systematically assess model performance. The findings reveal a significant performance drop among state-of-the-art models in open-ended generation, a tendency toward conservative responses, and shared cross-regional preference structures—highlighting an urgent need to shift current evaluation paradigms from knowledge-centric to reasoning-oriented approaches.
📝 Abstract
Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.