Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study systematically investigates the joint impact and inherent trade-offs between safety and fairness in large language models (LLMs) under parameter-efficient fine-tuning (PEFT). Across 235 fine-tuning variants, we comparatively evaluate four PEFT paradigms—LoRA, IA³, Prompt-Tuning, and P-Tuning—combined with instruction tuning, on multiple mainstream LLM families using multi-dimensional safety and fairness benchmarks. Results show that adapter-based methods (LoRA, IA³) consistently improve safety and mitigate group-level biases, whereas prompt-based methods (Prompt-Tuning, P-Tuning) generally degrade safety and severely compromise fairness. Crucially, enhanced safety does not entail improved fairness; instead, a fundamental trade-off exists between these two ethical dimensions. Moreover, the choice of base model significantly affects post-fine-tuning stability. This work is the first to empirically establish a strong association between PEFT methodology and downstream ethical attributes, providing evidence-based guidance and methodological insights for developing safe, fair, and lightweight fine-tuning strategies.

Technology Category

Application Category

📝 Abstract

Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Although these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model's safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.

Problem

Research questions and friction points this paper is trying to address.

Evaluating safety and fairness risks in parameter-efficient fine-tuning of LLMs

Assessing trade-offs between efficiency and alignment across different PEFT methods

Investigating how fine-tuning affects model safety hazards and demographic fairness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapter-based PEFT methods improve safety scores

Prompt-based fine-tuning reduces safety and fairness

Base model type strongly moderates alignment shifts

🔎 Similar Papers

No similar papers found.

Authors to Follow