Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs

📅 2025-01-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) remain vulnerable to jailbreak attacks, posing a fundamental trade-off between safety and performance. To address this, we propose SafeNudge—a real-time, lightweight inference-time defense mechanism. SafeNudge integrates controllable text generation with dynamic, context-aware textual “nudging” to detect and block adversarial inputs during inference, implemented seamlessly within the Hugging Face Transformers framework. Its core innovation is the first tunable Safety–Performance Trade-off (SPT) mechanism, enabling fine-grained configuration of security strength without incurring significant latency overhead (<1%) or degrading semantic fluency (BLEU score reduction <0.3). Extensive experiments demonstrate that SafeNudge reduces jailbreak success rates by 30% across diverse attack vectors. The method is open-sourced, compatible with mainstream open-weight LLMs, and deployable as a plug-and-play PyPI package—offering practical, low-overhead robustness for production LLM applications.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have been shown to be susceptible to jailbreak attacks, or adversarial attacks used to illicit high risk behavior from a model. Jailbreaks have been exploited by cybercriminals and blackhat actors to cause significant harm, highlighting the critical need to safeguard widely-deployed models. Safeguarding approaches, which include fine-tuning models or having LLMs"self-reflect", may lengthen the inference time of a model, incur a computational penalty, reduce the semantic fluency of an output, and restrict ``normal'' model behavior. Importantly, these Safety-Performance Trade-offs (SPTs) remain an understudied area. In this work, we introduce a novel safeguard, called SafeNudge, that combines Controlled Text Generation with"nudging", or using text interventions to change the behavior of a model. SafeNudge triggers during text-generation while a jailbreak attack is being executed, and can reduce successful jailbreak attempts by 30% by guiding the LLM towards a safe responses. It adds minimal latency to inference and has a negligible impact on the semantic fluency of outputs. Further, we allow for tunable SPTs. SafeNudge is open-source and available through https://pypi.org/, and is compatible with models loaded with the Hugging Face"transformers"library.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Escape Attack Defense
Performance-Security Tradeoff
Innovation

Methods, ideas, or system contributions that make the work stand out.

SafeNudge
LanguageModelSecurity
PerformanceBalance
🔎 Similar Papers
No similar papers found.