One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work reveals that jailbreaking vulnerabilities in pretrained large language models (LLMs) can be systematically inherited by their fine-tuned variants, posing a practical security threat. Under a “pretraining white-box, fine-tuning black-box” threat model, we find that adversarial prompts optimized against pretrained models exhibit strong cross-model transferability. For the first time, representation probing empirically confirms that the linear separability of jailbreaking prompts originates from hidden states learned during pretraining. Leveraging this insight, we propose Probe-Guided Projection (PGP), an attack framework that uses learned probes to steer adversarial optimization toward high-transferability directions. Experiments across multiple LLM families—including Llama, Qwen, and Phi—and diverse fine-tuning paradigms—such as instruction tuning, dialogue alignment, and domain adaptation—demonstrate that PGP significantly improves jailbreak transfer success rates. Our findings provide empirical evidence of a structural security inheritance risk inherent in the pretraining → fine-tuning pipeline.

Technology Category

Application Category

📝 Abstract

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.

Problem

Research questions and friction points this paper is trying to address.

Investigates inherited jailbreak vulnerabilities from pretrained to finetuned LLMs

Examines transferability of adversarial prompts via representation-level probing

Proposes Probe-Guided Projection attack to exploit these security risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial prompts transfer from pretrained to finetuned LLMs

Transferable prompts are linearly separable in hidden states

Probe-Guided Projection attack steers optimization for transferability

🔎 Similar Papers

SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance