ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models

📅 2024-10-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
The capability of existing large language models (LLMs) in detecting Chinese illegal content—particularly politically sensitive material, pornography, and phonetic/orthographic variants (e.g., homophone-based evasion)—remains poorly characterized. Method: We introduce ChineseSafe, the first LLM safety benchmark tailored to Chinese regulatory requirements, comprising 205K samples across four high-level risk categories and ten fine-grained subcategories. It features a rule-enhanced multi-stage annotation framework, human-expert validation for data curation, and a zero-/few-shot evaluation paradigm supporting both open-source models and commercial APIs. Crucially, it enables fine-grained safety classification and constructs localized adversarial examples (e.g., homophone perturbations) for the first time. Contribution/Results: Experiments expose critical vulnerabilities in mainstream LLMs—especially in identifying political sensitivities and phonetic evasion—some exceeding legally acceptable risk thresholds. ChineseSafe is publicly released and officially endorsed by Hugging Face.

Technology Category

Application Category

📝 Abstract
With the rapid development of Large language models (LLMs), understanding the capabilities of LLMs in identifying unsafe content has become increasingly important. While previous works have introduced several benchmarks to evaluate the safety risk of LLMs, the community still has a limited understanding of current LLMs' capability to recognize illegal and unsafe content in Chinese contexts. In this work, we present a Chinese safety benchmark (ChineseSafe) to facilitate research on the content safety of large language models. To align with the regulations for Chinese Internet content moderation, our ChineseSafe contains 205,034 examples across 4 classes and 10 sub-classes of safety issues. For Chinese contexts, we add several special types of illegal content: political sensitivity, pornography, and variant/homophonic words. Moreover, we employ two methods to evaluate the legal risks of popular LLMs, including open-sourced models and APIs. The results reveal that many LLMs exhibit vulnerability to certain types of safety issues, leading to legal risks in China. Our work provides a guideline for developers and researchers to facilitate the safety of LLMs. Our results are also available at https://huggingface.co/spaces/SUSTech/ChineseSafe-Benchmark.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' ability to identify unsafe Chinese content
Assesses legal risks of LLMs in Chinese contexts
Provides benchmark for Chinese Internet content moderation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chinese safety benchmark with 205,034 examples
Includes political sensitivity and variant words
Evaluates legal risks of LLMs in China
🔎 Similar Papers
No similar papers found.
H
Hengxiang Zhang
Department of Statistics and Data Science, Southern University of Science and Technology
Hongfu Gao
Hongfu Gao
National University of Singapore
Reliable Machine LearningNatural Language Processing
Q
Qiang Hu
Department of Statistics and Data Science, Southern University of Science and Technology
G
Guanhua Chen
Department of Statistics and Data Science, Southern University of Science and Technology
L
Lili Yang
Department of Statistics and Data Science, Southern University of Science and Technology
Bingyi Jing
Bingyi Jing
Chair Professor, Southern University of Science & Technology
StatisticsData ScienceAI
Hongxin Wei
Hongxin Wei
Southern University of Science and Technology (SUSTech)
Reliable Machine LearningUncertainty EstimationStatistics
B
Bing Wang
Deepexi Technology Co. Ltd.
H
Haifeng Bai
Deepexi Technology Co. Ltd.
L
Lei Yang
Deepexi Technology Co. Ltd.