Scenario-based Probing and Steering Cultural Values in Large Language Models--Extended Version

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of implicit cultural biases in large language models (LLMs) deployed across diverse societies, where homogeneous training data often embeds dominant value systems that standard questionnaire-based evaluations fail to uncover. The authors propose a novel approach by reframing the World Values Survey into contextualized behavioral dilemmas and employ token-level probability probing to reveal LLMs’ latent cultural representations along the Inglehart–Welzel two-dimensional value space for the first time. Through a combination of activation interventions, country-conditioned prompting, and hybrid steering strategies, they achieve fine-grained value alignment across three open-source LLMs and four target cultures. The findings demonstrate significant coupling between cultural dimensions—interventions on one dimension induce systematic shifts in the other—mirroring patterns observed in human values, all while preserving baseline task performance.

📝 Abstract

Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Cultural Values

Value Alignment

World Values Survey

Implicit Preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

scenario-based probing

activation steering

cultural alignment