🤖 AI Summary
This study systematically investigates the joint impact and inherent trade-offs between safety and fairness in large language models (LLMs) under parameter-efficient fine-tuning (PEFT). Across 235 fine-tuning variants, we comparatively evaluate four PEFT paradigms—LoRA, IA³, Prompt-Tuning, and P-Tuning—combined with instruction tuning, on multiple mainstream LLM families using multi-dimensional safety and fairness benchmarks. Results show that adapter-based methods (LoRA, IA³) consistently improve safety and mitigate group-level biases, whereas prompt-based methods (Prompt-Tuning, P-Tuning) generally degrade safety and severely compromise fairness. Crucially, enhanced safety does not entail improved fairness; instead, a fundamental trade-off exists between these two ethical dimensions. Moreover, the choice of base model significantly affects post-fine-tuning stability. This work is the first to empirically establish a strong association between PEFT methodology and downstream ethical attributes, providing evidence-based guidance and methodological insights for developing safe, fair, and lightweight fine-tuning strategies.
📝 Abstract
Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Although these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model's safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.