DIESEL - Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the problem of large language models (LLMs) generating outputs misaligned with human values—such as ethics and safety—this paper proposes a lightweight, training-free, inference-time steering mechanism. The method dynamically measures, in latent semantic space, the projection distance between each candidate token and predefined negative concepts (e.g., harmful or unethical semantics), and subsequently re-ranks tokens during autoregressive decoding to suppress unsafe continuations. It operates in a zero-shot, plug-and-play manner and supports multi-layer collaborative defense. Experimental results demonstrate significant safety improvements across mainstream dialogue models, robust resistance to diverse jailbreaking attacks, and minimal computational overhead. The approach maintains high generality and low latency, making it suitable for both safety-enhanced LLM deployment and broad-spectrum harmful content filtering.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have had great success in tasks such as casual conversation, contributing to significant advancements in domains like virtual assistance. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models, even in adversarial jailbreaking scenarios that challenge response safety. We also highlight DIESEL's generalization capabilities, showing that it can be used in use cases other than safety, providing general-purpose response filtering.
Problem

Research questions and friction points this paper is trying to address.

Address misalignment of LLM responses with human values
Reduce computational cost of ensuring response safety
Filter undesired concepts in LLM outputs efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight inference-guidance technique for LLMs
Semantic filtering of undesired concepts in responses
Reranking tokens based on negative concept similarity
B
Ben Ganon
Ben-Gurion University of the Negev
A
Alon Zolfi
Ben-Gurion University of the Negev
Omer Hofman
Omer Hofman
Ben-Gurion University of the Negev
Inderjeet Singh
Inderjeet Singh
Fujitsu
Generative AIRobust AIPrivate AIDeep LearningCyber Security
H
Hisashi Kojima
Fujitsu Limited
Y
Y. Elovici
Ben-Gurion University of the Negev
A
A. Shabtai
Ben-Gurion University of the Negev