The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the robustness of large language models (LLMs) against persuasive prompts following machine unlearning. We propose the Stimulus–Knowledge Entanglement Behavior (SKeB) framework, which uniquely integrates the ACT-R cognitive architecture with Hebbian learning theory to construct domain-specific knowledge graphs; introduces a diffusion-based entanglement metric; and conducts prompt engineering experiments grounded in communication-theoretic persuasion frameworks (e.g., authority, emotion). Results reveal that unlearning is incomplete: persuasive prompts significantly increase knowledge recall from 14.8% to 24.5%. Moreover, model size exhibits a strong negative correlation with knowledge recovery—2.7B models achieve 128% recovery relative to baseline, whereas 13B models recover only 15%. This work provides the first quantitative characterization of the trade-off between forgetting completeness and behavioral robustness, uncovers distinct activation patterns for hallucinations, factual, and non-factual outputs, and establishes a novel paradigm for trustworthy AI governance.

Technology Category

Application Category

📝 Abstract

Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating unlearning effectiveness in large language models

Measuring knowledge recall via persuasive prompting techniques

Assessing unlearning robustness across different model sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Stimulus-Knowledge Entanglement-Behavior Framework for unlearning

Models information entanglement using domain graphs

Develops metrics to quantify knowledge activation patterns

🔎 Similar Papers

No similar papers found.

Authors to Follow