🤖 AI Summary
Deploying secure large language models (LLMs) on resource-constrained edge devices is hindered by substantial memory and computational overhead, particularly rendering dual-model defense schemes impractical. This work presents the first systematic validation of soft prompts for safety alignment and introduces a novel framework that integrates soft prompts with knowledge distillation based on total variation and KL divergence to efficiently transfer safety behaviors from a guardian model to the primary model. Combined with parameter-efficient fine-tuning, the approach achieves a strong balance between high safety and low resource consumption across diverse LLM architectures. Experiments demonstrate that the method significantly outperforms alternatives such as LoRA and steering vectors on multiple benchmarks, attaining superior safety-utility trade-offs with negligible additional inference overhead.
📝 Abstract
Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.