SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

📅 2026-06-01

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This work addresses the instability of safety decisions in streaming outputs from large language models, which arises from response-level latency interference and token-level semantic fragmentation. To resolve this, we propose SentGuard, the first sentence-level streaming guardrail mechanism that dynamically aggregates streaming tokens into complete sentences via a lightweight buffer and performs parallel safety evaluation, releasing content only after verification to jointly preserve timeliness and semantic integrity. We introduce StreamSafe, the first benchmark with per-sentence safety annotations, and devise a coarse-to-fine multi-granularity supervised training strategy for early risk detection. Experiments demonstrate that SentGuard detects 90.5% of unsafe content within two sentences across five safety benchmarks, achieving a remarkably low streaming false-positive rate of 7.41%.

📝 Abstract

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

Problem

Research questions and friction points this paper is trying to address.

streaming guardrails

large language models

sentence-level moderation

real-time safety

harm detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

sentence-level guardrail

streaming moderation

large language models