REMIND: Input Loss Landscapes Reveal Residual Memorization in Post-Unlearning LLMs

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the challenge of detecting residual memorization after data unlearning in large language models (LLMs), this paper proposes REMIND, the first method to assess unlearning efficacy via dynamic loss landscape analysis within the semantic neighborhood of forgotten data. Methodologically, REMIND generates semantically proximal inputs via input perturbations and characterizes the flatness of the loss surface near the forgotten samples—successful unlearning yields a flatter local loss landscape. It quantifies unlearning extent by jointly measuring gradient smoothness and pattern consistency. Crucially, REMIND operates via black-box model queries only, requires no access to model parameters, and exhibits robustness to input paraphrasing. Empirically, it significantly outperforms existing evaluation baselines across multiple LLMs and datasets. The approach is deployable in practice and establishes a novel, interpretable, and highly sensitive paradigm for trustworthy verification of machine unlearning.

Technology Category

Application Category

📝 Abstract

Machine unlearning aims to remove the influence of specific training data from a model without requiring full retraining. This capability is crucial for ensuring privacy, safety, and regulatory compliance. Therefore, verifying whether a model has truly forgotten target data is essential for maintaining reliability and trustworthiness. However, existing evaluation methods often assess forgetting at the level of individual inputs. This approach may overlook residual influence present in semantically similar examples. Such influence can compromise privacy and lead to indirect information leakage. We propose REMIND (Residual Memorization In Neighborhood Dynamics), a novel evaluation method aiming to detect the subtle remaining influence of unlearned data and classify whether the data has been effectively forgotten. REMIND analyzes the model's loss over small input variations and reveals patterns unnoticed by single-point evaluations. We show that unlearned data yield flatter, less steep loss landscapes, while retained or unrelated data exhibit sharper, more volatile patterns. REMIND requires only query-based access, outperforms existing methods under similar constraints, and demonstrates robustness across different models, datasets, and paraphrased inputs, making it practical for real-world deployment. By providing a more sensitive and interpretable measure of unlearning effectiveness, REMIND provides a reliable framework to assess unlearning in language models. As a result, REMIND offers a novel perspective on memorization and unlearning.

Problem

Research questions and friction points this paper is trying to address.

Detect residual memorization in post-unlearning language models

Evaluate forgetting effectiveness beyond individual input level

Prevent indirect information leakage through semantically similar examples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes loss over small input variations

Detects residual influence in semantically similar examples

Classifies data as effectively forgotten or not

🔎 Similar Papers

No similar papers found.

Authors to Follow