🤖 AI Summary
This study addresses the dual gap in existing evaluation frameworks for sovereign large language models (LLMs): the lack of integrated assessment of socio-cultural alignment and technical security. We introduce the first multidimensional benchmark dataset unifying cultural alignment, low-resource language support, and safety robustness, alongside the first socio-cultural–technical security co-analysis framework. Methodologically, we integrate qualitative analysis of culturally embedded expressions with quantitative evaluation of safety compliance and adversarial robustness, enabling cross-dimensional empirical assessment. Results reveal that while sovereign LLMs exhibit baseline competence in low-resource languages, they suffer from weak cultural alignment and frequently compromise safety to achieve linguistic fluency—demonstrating a significant trade-off between cultural adaptation and security. Our work advances a more comprehensive, evidence-driven evaluation paradigm for sovereign AI systems and provides a foundational methodology for trustworthy AI governance.
📝 Abstract
Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users' socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.