Investigating Model Editing for Unlearning in Large Language Models

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the challenge of targeted machine unlearning in large language models (LLMs). Unlike conventional fine-tuning or data-deletion approaches, we propose an efficient unlearning method grounded in the model editing paradigm. We are the first to systematically evaluate and adapt causal mediation–based editing algorithms—including ROME, IKE, and WISE—for machine unlearning tasks. Crucially, we reformulate the editing objective to emphasize precise knowledge localization and controllable, localized parameter modification—driven by gradients or activations—enabling accurate removal of targeted information. On multiple standard unlearning benchmarks, our method substantially reduces residual memory rates while limiting downstream task performance degradation to under 3%; in certain settings, it outperforms state-of-the-art unlearning baselines. Our core contributions are: (i) establishing model editing as a novel, high-fidelity paradigm for targeted unlearning; and (ii) introducing principled design criteria and technical pathways for unlearning-aware editing objectives.

Technology Category

Application Category

📝 Abstract

Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.

Problem

Research questions and friction points this paper is trying to address.

Enhancing unlearning efficiency in large language models

Applying model editing algorithms to remove unwanted information

Balancing information removal with retained model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model editing algorithms adapted for unlearning

New editing targets designed for unlearning setting

Editing approaches exceed baseline unlearning methods quality

🔎 Similar Papers

Towards Effective Evaluations and Comparisons for LLM Unlearning Methods