Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

To address the degradation of safety alignment in Fine-tuning-as-a-Service caused by users’ private data containing harmful prompts, this paper proposes an unsupervised framework for harmful data filtering and alignment knowledge distillation based on “refusal features.” We formally define generalizable refusal features—decoupling safety alignment capability into directional representations—and automatically extract them from safety-aligned LLMs; harmful prompts are then identified without annotations via feature similarity modeling. Our method adopts a two-stage fine-tuning paradigm: (1) data filtering to exclude harmful prompts, followed by (2) alignment-aware knowledge distillation. Evaluated across multiple safety benchmarks and domain-specific tasks, our approach reduces harmful output rates by 62% on average while improving downstream task accuracy by 2.4%, achieving state-of-the-art trade-offs between safety and performance.

Technology Category

Application Category

📝 Abstract

Recently, major AI service providers such as Google and OpenAI have introduced Finetuning-as-a-Service, which enables users to customize Large Language Models (LLMs) for specific downstream tasks using their own data. However, this service is vulnerable to degradation of LLM safety-alignment when user data contains harmful prompts. While some prior works address this issue, fundamentally filtering harmful data from user data remains unexplored. Motivated by our observation that a directional representation reflecting refusal behavior (called the refusal feature) obtained from safety-aligned LLMs can inherently distinguish between harmful and harmless prompts, we propose the Refusal-Feature-guided Teacher (ReFT). Our ReFT model is trained to identify harmful prompts based on the similarity between input prompt features and its refusal feature. During finetuning, the ReFT model serves as a teacher that filters harmful prompts from user data and distills alignment knowledge into the base model. Extensive experiments demonstrate that our ReFT-based finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in Finetuning-as-a-Service.

Problem

Research questions and friction points this paper is trying to address.

Filter harmful prompts from user data for safe LLM finetuning

Distill alignment knowledge to maintain LLM safety during finetuning

Prevent degradation of safety-alignment in Finetuning-as-a-Service

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses refusal feature to filter harmful prompts

Trains teacher model for data filtering

Distills alignment knowledge into base model

🔎 Similar Papers

Safety Layers in Aligned Large Language Models: The Key to LLM Security