Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the underexplored risk that fine-tuning large language models may compromise their safety, noting that existing evaluations often lack alignment with specific capability objectives, leading to fragmented and incomparable conclusions. To remedy this, the authors propose a capability-oriented safety evaluation framework that integrates multidimensional behavioral analysis, multiple safety benchmarks, and systematic comparisons across diverse evaluators. Their analysis reveals that fine-tuned models frequently generate incoherent outputs when prompted with safety-related inputs. Crucially, the study demonstrates that safety assessment outcomes are highly sensitive to the choice of benchmark and evaluator, and it exposes significant limitations in current automated safety detection methods when confronted with such incoherent yet potentially harmful content, thereby challenging the reliability of these widely used evaluation approaches.

📝 Abstract

Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.

Problem

Research questions and friction points this paper is trying to address.

fine-tuning

safety

large language models

capability

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

capability-grounded safety

fine-tuning effects

incoherent generations