Superficial Safety Alignment Hypothesis

📅 2024-10-07
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical challenges in large language model (LLM) safety alignment—namely, mechanistic fragility, high alignment overhead, and lack of interpretability—by proposing the “superficiality” hypothesis. It formalizes safety alignment as a binary classification task guiding models to select correct reasoning paths, augmented with a multi-option rejection mechanism. Leveraging component-level interpretability analysis and neuron-granular functional attribution, we identify four atomic safety units (ESU, EUU, CU, RU)—the first such taxonomy—demonstrating that freezing only 7.5% of critical units preserves safety, while reusing 20% of redundant units significantly reduces alignment tax. Experiments show our method achieves efficient, transferable alignment without complex fine-tuning, enabling decoupled optimization of safety and utility. It consistently improves both safety and task performance across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units 20% in the pre-trained model as an ``alignment budget'' can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.
Problem

Research questions and friction points this paper is trying to address.

Addressing safety mechanism brittleness in LLM alignment
Proposing superficial safety alignment as binary classification task
Identifying critical neuron-level components for safety guardrails
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Superficial Safety Alignment Hypothesis for safety
Identifies critical neuron units for safety and utility
Freezes safety components to retain alignment during fine-tuning
🔎 Similar Papers
No similar papers found.
J
Jianwei Li
Department of Computer Science, North Carolina State University, Raleigh, NC, USA
Jung-Eun Kim
Jung-Eun Kim
Assistant Professor, Computer Science, North Carolina State University
Trustworthy AIInterpretable AIEfficient AIAI Safety