Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the overlooked vulnerability of multilingual multimodal large language models in low-resource languages, particularly concerning adversarial robustness and truthful safety alignment. Through a systematic evaluation of open-source models across twelve languages—employing gradient-based adversarial attacks, multilingual harmful instruction injection (both textual and visual), visual encoder analysis, and cross-lingual transfer tests—the work uncovers a “failure-as-safety” phenomenon: models fine-tuned solely on instructions exhibit spurious safety in non-English languages due to visual parsing failures. In contrast, models trained comprehensively with multilingual data, such as Qwen3-VL, demonstrate genuine cross-lingual safety refusal capabilities. These findings underscore that deep multilingual integration is essential for achieving reliable and authentic safety alignment across diverse languages.

📝 Abstract

Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire multilingual capability through instruction tuning. Gradient-based attacks reveal a transferable multilingual vulnerability: adversarial images optimized in one language continue to induce failure in others, demonstrating strong cross-lingual transferability. Multilingual safety further varies with how effectively a model retrieves or interprets harmful instructions. When harmful intent is issued through text, languages with stronger linguistic grounding more often elicit misuse-enabling responses, while weaker languages produce fewer unsafe outputs. When embedded in the image as typographic content, English scripts are reliably recognised and followed, whereas non-English scripts are rarely parsed by the vision encoder. Lower-resource languages may therefore appear safer, but this is an artefact of comprehension and visual-grounding failures rather than genuine alignment, a phenomenon we term safety-by-failure. In contrast, MLLMs that build multilingual capability throughout their training stages rather than only at instruction tuning, such as Qwen3-VL, exhibit genuine cross-lingual safety, maintaining active refusal across languages rather than masking comprehension failure. Shallow multilingual adaptation, such as fine-tuning on translated instruction data, may produce surface-level understanding that creates illusory safety in low-resource languages; deeper integration across training stages leads to genuine multilingual safety alignment.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

safety alignment

multilingual

multimodal large language models

cross-lingual transferability

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual multimodal LLMs

adversarial robustness

safety alignment