Trust The Typical

📅 2026-02-04

📈 Citations: 1

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the limitations of existing safety mechanisms in large language models, which rely on known threat detection and suffer from high false-positive rates and vulnerability to attacks. The authors propose modeling safety alignment as an out-of-distribution detection problem in semantic space, learning the typical distribution of benign prompts without requiring harmful examples. This approach enables generalizable, cross-domain, and multilingual protection. By integrating semantic space modeling, GPU-optimized inference, and real-time guardrailing within the vLLM framework, the method achieves state-of-the-art performance across 18 safety benchmarks, reduces false positives by up to 40×, supports over 14 languages, and incurs less than 6% inference overhead.

Technology Category

Application Category

📝 Abstract

Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.

Problem

Research questions and friction points this paper is trying to address.

LLM safety

out-of-distribution detection

harmful content

guardrails

robust safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

out-of-distribution detection

LLM safety

semantic distribution modeling