Low-Resource Safety Failures Are Action Failures, Not Representation Failures

📅 2026-05-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

148K/year
🤖 AI Summary
This work addresses the significant degradation in safety alignment of large language models when applied to low-resource languages, demonstrating that the root cause lies not in missing representations but in failed decision calibration at the action level. The study reveals, for the first time, that misalignment manifests primarily in decision-making rather than representation, and proposes a lightweight intervention—low-rank logistic readout combined with threshold recalibration—that requires only 1–4 target-language examples and no model retraining to effectively patch cross-lingual safety vulnerabilities. By integrating linear separability analysis of harmfulness directions with a rejection steering mechanism, the method boosts average rejection selectivity from 33.6 to 54.5 across 23 languages while preserving MMLU performance, confirming its efficacy and broad applicability.
📝 Abstract
Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($Δ$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
safety alignment
representation transfer
refusal calibration
cross-lingual safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

safety alignment
low-resource languages
representation calibration
adaptive steering
refusal selectivity