🤖 AI Summary
This study systematically evaluates the cultural value alignment of 10 mainstream large language models (LLMs) across 20 national cultural contexts, aiming to uncover latent cultural biases and limitations in cross-cultural adaptability. Method: Grounded in Hofstede’s and Schwartz’s cultural value frameworks, we construct a multilingual, human-annotated ground-truth benchmark and propose the first cross-national, multi-model quantitative metric for cultural alignment, complemented by cross-lingual consistency analysis and attribution testing. Contribution/Results: (1) LLMs exhibit a “cultural middle-ground” tendency but consistently align more closely with U.S. cultural values than with those of their country of origin; (2) model origin, prompt language, and value dimensions interact significantly; (3) GLM-4 achieves the highest alignment performance, and all models show significantly stronger alignment with U.S. values than with Chinese values. These findings provide empirically grounded, reproducible evaluation protocols and critical insights for designing culturally adaptive LLMs.
📝 Abstract
LLMs as intelligent agents are being increasingly applied in scenarios where human interactions are involved, leading to a critical concern about whether LLMs are faithful to the variations in culture across regions. Several works have investigated this question in various ways, finding that there are biases present in the cultural representations of LLM outputs. To gain a more comprehensive view, in this work, we conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. With a renowned cultural values questionnaire and by carefully analyzing LLM output with human ground truth scores, we thoroughly study LLMs' cultural alignment across countries and among individual models. Our findings show that the output over all models represents a moderate cultural middle ground. Given the overall skew, we propose an alignment metric, revealing that the United States is the best-aligned country and GLM-4 has the best ability to align to cultural values. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output. Specifically, models, regardless of where they originate, align better with the US than they do with China. The conclusions provide insight to how LLMs can be better aligned to various cultures as well as provoke further discussion of the potential for LLMs to propagate cultural bias and the need for more culturally adaptable models.