Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the over-refusal problem of RLHF-aligned large language models on sensitive queries, this work proposes a generation-time intervention method that requires no modification to model weights, prompts, or training data. The core insight is the first identification and exploitation of formatted token sequences—such as double newlines ( ), EOS tokens, or special start markers—at structural boundaries of chain-of-thought (CoT) reasoning as critical triggers for refusal behavior. By dynamically suppressing token logits at these boundary positions during decoding, the method precisely blocks the output subspace associated with refusal while preserving coherent, substantive responses. It is entirely parameter-free, training-free, and data-agnostic. Evaluated on a DeepSeek-R1 distilled model, the approach significantly increases the rate of substantive answers to sensitive questions, while maintaining zero performance degradation on standard benchmarks including MMLU and BBH.

Technology Category

Application Category

📝 Abstract
We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block () followed by a double newline token ( ), we investigate the impact of two simple formatting adjustments during generation: suppressing afterand suppressing the end-of-sequence token after the end of the CoT block (). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.
Problem

Research questions and friction points this paper is trying to address.

Reduce refusal rates in LLMs on sensitive content
Suppress specific token sequences to avoid refusals
Improve answer rates without modifying model weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Suppress specific token sequences during generation
Modify token probabilities without training
Block refusal subspaces in generation process
🔎 Similar Papers
No similar papers found.