Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the challenge of real-time detection of unsafe prompts, toxic language, jailbreak attacks, and harmful responses in large language model applications by proposing Opir, a lightweight encoder-based guardrail model. Built upon the GLiClass architecture, Opir implements a unified multi-task safety classification framework that integrates a three-tier taxonomy encompassing 996 classes, adversarial negative sample mining, multilingual data augmentation, and a hybrid training strategy combining Aegis2 and WildGuard. The model supports binary safety judgments, multi-label toxicity identification, jailbreak detection, and zero-shot categorization. Evaluated across 12 primary tasks and 17 fine-grained benchmarks, Opir matches or exceeds the performance of eight leading guardrail systems while maintaining under 100 million parameters, substantially reducing deployment overhead. The authors also release an open-source, multi-backend evaluation suite.

📝 Abstract

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

Problem

Research questions and friction points this paper is trying to address.

safety classification

toxicity

jailbreaks

harmful content

real-time filtering

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-task safety classification

efficient guardrail models

GLiClass architecture