🤖 AI Summary
Current large language models (LLMs) predominantly encode Western mainstream cultural narratives, limiting their alignment with the values and commonsense knowledge of diverse U.S. populations—particularly marginalized communities. Existing national-level alignment benchmarks (e.g., KorNAT) lack granular, community-level representativeness. To address this, we introduce CIVIQ: the first cultural intelligence evaluation benchmark explicitly designed to assess LLM alignment with community-level social values and culturally situated commonsense reasoning. Methodologically, CIVIQ transcends nation-scale abstractions by integrating qualitative ethnographic research and social computing to construct a cross-cultural transfer framework. It employs localized data collection and culturally sensitive annotation to build a multiracial, intergenerational, and geographically diverse evaluation dataset. CIVIQ provides a reusable, methodologically grounded toolkit for developing, evaluating, and iteratively refining culturally aware LLMs—thereby advancing concrete, practice-oriented progress in AI fairness and inclusion.
📝 Abstract
Large language models (LLMs) have emerged as a powerful technology, and thus, we have seen widespread adoption and use on software engineering teams. Most often, LLMs are designed as "general purpose" technologies meant to represent the general population. Unfortunately, this often means alignment with predominantly Western Caucasian narratives and misalignment with other cultures and populations that engage in collaborative innovation. In response to this misalignment, there have been recent efforts centered on the development of "culturally-informed" LLMs, such as ChatBlackGPT, that are capable of better aligning with historically marginalized experiences and perspectives. Despite this progress, there has been little effort aimed at supporting our ability to develop and evaluate culturally-informed LLMs. A recent effort proposed an approach for developing a national alignment benchmark that emphasizes alignment with national social values and common knowledge. However, given the range of cultural identities present in the United States (U.S.), a national alignment benchmark is an ineffective goal for broader representation. To help fill this gap in this US context, we propose a replication study that translates the process used to develop KorNAT, a Korean National LLM alignment benchmark, to develop CIVIQ, a Cultural Intelligence and Values Inference Quality benchmark centered on alignment with community social values and common knowledge. Our work provides a critical foundation for research and development aimed at cultural alignment of AI technologies in practice.