TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This study addresses the critical gap in large language model (LLM) safety evaluations, which predominantly rely on English and overlook risks in low-resource languages such as those spoken in Africa. The authors introduce the first jailbreaking benchmark covering seven African languages, systematically examining how linguistic diversity, cultural context, and prompting strategies affect model safety through four experimental settings: human translation, culturally adapted prompts, human-in-the-loop validation, and code-switching. They propose a novel jailbreaking category termed “deflection” and uncover two structural limitations in low-resource settings: degraded model comprehension and reduced reliability of automated safety judgments. Empirical results demonstrate that prompts in African languages significantly lower refusal rates, with culturally adapted prompts proving most effective at bypassing safeguards. Moreover, LLM-as-a-judge shows markedly weaker alignment with human judgments in these languages.

📝 Abstract

Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation, human-curated prompts validated through interactions with GPT-5.2, and code-switched prompts combining English and African languages, isolating the effect of language, cultural grounding, and prompt evasiveness on model safety. Across closed and open models, prompting in African languages reduces refusal relative to English, with culturally adapted prompts leading to least refusal. The evaluation also surfaces two structural limitations: model comprehension failures and reduced LLM-as-a-judge reliability in LRLs. To capture the first, we introduce Deflection alongside Refused and Jailbroken; to assess the second, we validate outputs with human annotations, showing that judge-human agreement drops in lower-resource languages and less commonly supported scripts.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Safety Evaluation

Low-Resource Languages

African Languages

Jailbreak

Innovation

Methods, ideas, or system contributions that make the work stand out.

culturally grounded benchmark

low-resource languages

jailbreak evaluation