🤖 AI Summary
This study investigates the robustness of large language models (LLMs) against persuasive prompts following machine unlearning. We propose the Stimulus–Knowledge Entanglement Behavior (SKeB) framework, which uniquely integrates the ACT-R cognitive architecture with Hebbian learning theory to construct domain-specific knowledge graphs; introduces a diffusion-based entanglement metric; and conducts prompt engineering experiments grounded in communication-theoretic persuasion frameworks (e.g., authority, emotion). Results reveal that unlearning is incomplete: persuasive prompts significantly increase knowledge recall from 14.8% to 24.5%. Moreover, model size exhibits a strong negative correlation with knowledge recovery—2.7B models achieve 128% recovery relative to baseline, whereas 13B models recover only 15%. This work provides the first quantitative characterization of the trade-off between forgetting completeness and behavioral robustness, uncovers distinct activation patterns for hallucinations, factual, and non-factual outputs, and establishes a novel paradigm for trustworthy AI governance.
📝 Abstract
Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.