The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the multilingual jailbreaking vulnerabilities of closed-source large language models—including GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max—in both Chinese and English settings. To address the lack of standardized cross-lingual adversarial evaluation, we propose the first integrated, cross-lingual, multi-model, and multi-attack benchmark framework, incorporating 32 jailbreak techniques across six safety-sensitive content categories and using Attack Success Rate (ASR) as the primary metric. We empirically discover that Chinese prompts exhibit significantly higher adversarial potency than their English counterparts—a previously unreported phenomenon. Furthermore, we introduce a novel Two-Sides attack strategy, achieving an average ASR improvement of 12.7% across all models, establishing it as the most effective cross-model jailbreaking technique to date. Our findings underscore the necessity of language-aware alignment and cross-lingual collaborative defense mechanisms. Experiments on 38,400 model responses reveal Qwen-Max as the most vulnerable and GPT-4o as the most robust.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have seen widespread applications across various domains, yet remain vulnerable to adversarial prompt injections. While most existing research on jailbreak attacks and hallucination phenomena has focused primarily on open-source models, we investigate the frontier of closed-source LLMs under multilingual attack scenarios. We present a first-of-its-kind integrated adversarial framework that leverages diverse attack techniques to systematically evaluate frontier proprietary solutions, including GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Our evaluation spans six categories of security contents in both English and Chinese, generating 38,400 responses across 32 types of jailbreak attacks. Attack success rate (ASR) is utilized as the quantitative metric to assess performance from three dimensions: prompt design, model architecture, and language environment. Our findings suggest that Qwen-Max is the most vulnerable, while GPT-4o shows the strongest defense. Notably, prompts in Chinese consistently yield higher ASRs than their English counterparts, and our novel Two-Sides attack technique proves to be the most effective across all models. This work highlights a dire need for language-aware alignment and robust cross-lingual defenses in LLMs, and we hope it will inspire researchers, developers, and policymakers toward more robust and inclusive AI systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multilingual jailbreak attacks on closed-source LLMs
Assessing vulnerability of proprietary models like GPT-4o and Qwen-Max
Analyzing attack success rates across languages and prompt designs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrated adversarial framework for closed-source LLMs
Multilingual attack scenarios in English and Chinese
Two-Sides attack technique for higher success rates
🔎 Similar Papers
No similar papers found.
Linghan Huang
Linghan Huang
The University of Sydney
Trustworthy MLSoftware Security
Haolin Jin
Haolin Jin
University of Sydney
Z
Zhaoge Bi
School of Electrical and Computer Engineering, University of Sydney
P
Pengyue Yang
School of Electrical and Computer Engineering, University of Sydney
P
Peizhou Zhao
School of Electrical and Computer Engineering, University of Sydney
T
Taozhao Chen
School of Electrical and Computer Engineering, University of Sydney
X
Xiongfei Wu
The University of Tokyo, Japan
L
Lei Ma
The University of Tokyo, Japan
Huaming Chen
Huaming Chen
The University of Sydney
Trustworthy MLApplied Machine LearningData MiningService Computing