De-attribute to Forget for LLM Unlearning

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the challenge of efficiently and precisely unlearning specific information from large language models (LLMs) due to undesirable training data, a task often hindered by over-unlearning or degradation of model utility in existing approaches. The paper introduces DareU, the first forgetting framework grounded in data attribution rewards, which reframes the unlearning objective as nullifying the attribution scores of target data. DareU employs a reinforcement learning–driven de-attribution strategy to update model parameters, innovatively shifting the paradigm from loss-based optimization to attribution elimination. It leverages an LLM-based classifier to approximate data attribution as a reward signal. Experimental results demonstrate that DareU significantly enhances unlearning efficacy while preserving overall model performance, outperforming current baselines and achieving a superior trade-off between forgetting completeness and model utility.

📝 Abstract

The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.

Problem

Research questions and friction points this paper is trying to address.

LLM unlearning

data attribution

over-forgetting

model utility

forget set

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM unlearning

data attribution

de-attribution