🤖 AI Summary
This study investigates the mechanistic degradation and reversibility of language models under toxic data fine-tuning. Toxic fine-tuning induces model corruption, yet its underlying neural mechanisms and potential for recovery remain poorly understood. Method: Leveraging causal tracing and circuit localization—key techniques from mechanistic interpretability—alongside task-specific fine-tuning and clean-data reverse retraining, we conduct controlled ablation and reconstruction experiments. Results: We establish, for the first time, that corruption exhibits *circuit-level specificity*: only critical computational pathways are selectively impaired, while peripheral circuits remain intact. Crucially, we demonstrate *neuroplastic-like recoverability*: clean-data retraining reconstructs original functional mechanisms with >89% restoration fidelity; this recovery generalizes across fine-tuning epochs. Contribution: Our work identifies precise circuit-level localization principles governing corruption and empirically validates the reversibility of mechanistic damage—providing both theoretical foundations and actionable strategies for robust alignment and trustworthy fine-tuning.
📝 Abstract
Previous research has shown that fine-tuning language models on general tasks enhance their underlying mechanisms. However, the impact of fine-tuning on poisoned data and the resulting changes in these mechanisms are poorly understood. This study investigates the changes in a model's mechanisms during toxic fine-tuning and identifies the primary corruption mechanisms. We also analyze the changes after retraining a corrupted model on the original dataset and observe neuroplasticity behaviors, where the model relearns original mechanisms after fine-tuning the corrupted model. Our findings indicate that: (i) Underlying mechanisms are amplified across task-specific fine-tuning which can be generalized to longer epochs, (ii) Model corruption via toxic fine-tuning is localized to specific circuit components, (iii) Models exhibit neuroplasticity when retraining corrupted models on clean dataset, reforming the original model mechanisms.