Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This study addresses systematic blind spots in existing global safety models when handling culturally specific risks in Taiwan Mandarin, such as financial scams, culturally embedded hate speech, and disinformation. We propose and validate the hypothesis that effective safety detection relies on the base model’s pre-existing cultural and sociolinguistic knowledge, demonstrating that safety fine-tuning alone cannot introduce such knowledge de novo. To this end, we construct TS-Bench—the first risk evaluation benchmark focused on Taiwan Mandarin—and develop Breeze Guard, an 8B-parameter safety model based on the culturally pre-trained Breeze 2 architecture, refined via supervised fine-tuning with human-verified synthetic data. Experiments show that Breeze Guard achieves a 0.17 higher overall F1 score than Granite Guardian 3.3 on TS-Bench, with particularly significant gains in financial scam (+0.66) and financial misconduct (+0.43) detection, marking a breakthrough in region-specific safety performance.

Technology Category

Application Category

📝 Abstract

Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks such as localized financial scams, culturally embedded hate speech, and misinformation patterns. To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin. TS-Bench contains 400 human-curated prompts spanning critical domains including financial fraud, medical misinformation, social discrimination, and political manipulation. In parallel, we present Breeze Guard, an 8B safety model derived from Breeze 2, our previously released general-purpose Taiwanese Mandarin LLM with strong cultural grounding from its original pre-training corpus. Breeze Guard is obtained through supervised fine-tuning on a large-scale, human-verified synthesized dataset targeting Taiwan-specific harms. Our central hypothesis is that effective safety detection requires the cultural grounding already present in the base model; safety fine-tuning alone is insufficient to introduce new socio linguistic knowledge from scratch. Empirically, Breeze Guard significantly outperforms the leading 8B general-purpose safety model, Granite Guardian 3.3, on TS-Bench (+0.17 overall F1), with particularly large gains in high-context categories such as scam (+0.66 F1) and financial malpractice (+0.43 F1). While the model shows slightly lower performance on English-centric benchmarks (ToxicChat, AegisSafetyTest), this tradeoff is expected for a regionally specialized safety model optimized for Taiwanese Mandarin. Together, Breeze Guard and TS-Bench establish a new foundation for trustworthy AI deployment in Taiwan.

Problem

Research questions and friction points this paper is trying to address.

Taiwanese Mandarin

AI safety

cultural nuance

region-specific risks

trustworthy AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Taiwanese Mandarin

cultural grounding

safety benchmark