The Hidden Space of Safety: Understanding Preference-Tuned LLMs in Multilingual context

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper identifies a critical resource bias in large language models (LLMs) regarding multilingual alignment: current English-centric alignment methods fail to generalize to low-resource languages, resulting in weakened safety representations and sharp degradation in alignment performance. To address this, the authors introduce “safety space separation degree”—a novel quantitative metric—to systematically analyze distributional shifts across embedding spaces of seven multilingual LLMs. Leveraging a balanced toxicity evaluation set and a parallel-text detoxification benchmark, they conduct cross-lingual comparative experiments. Results reveal significant embedding-space separation between high- and low-resource languages, demonstrating that a single alignment strategy cannot ensure fairness or robustness across languages. The study empirically validates the necessity of language-specific alignment fine-tuning, thereby providing both theoretical grounding and empirical evidence for developing equitable and reliable multilingual safety alignment frameworks.

Technology Category

Application Category

📝 Abstract

Alignment tuning has enabled large language models to excel in reasoning, instruction-following, and minimizing harmful generations. However, despite their widespread deployment, these models exhibit a monolingual bias, raising concerns about the effectiveness of alignment across languages. Current alignment methods predominantly focus on English, leaving it unclear how alignment mechanism generalize to multilingual settings. To address this, we conduct a systematic analysis of distributional shifts in the embedding space of LLMs before and after alignment, uncovering its impact on model behavior across diverse languages. We leverage the alignment-induced separation in safety space as a quantitative tool to measure how alignment enforces safety constraints. Our study evaluates seven LLMs using balanced toxicity datasets and parallel text-detoxification benchmarks, revealing substantial disparities in the latent representation space between high-resource and low-resource languages. These findings underscore the need for language-specific fine-tuning to ensure fair, reliable and robust multilingual alignment. Our insights provide a foundation for developing truly safe multilingual LLMs, emphasizing the urgency of addressing alignment gaps in underrepresented languages.

Problem

Research questions and friction points this paper is trying to address.

Monolingual bias in preference-tuned LLMs across languages

Generalization gaps of alignment methods in multilingual settings

Disparities in safety constraints enforcement for low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing embedding space shifts post-alignment

Using safety space separation as measurement tool

Evaluating LLMs with balanced toxicity datasets

🔎 Similar Papers

No similar papers found.

Authors to Follow