Neuron Empirical Gradient: Discovering and Quantifying Neurons Global Linear Controllability

📅 2024-12-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses a fundamental interpretability challenge in pretrained language models (PLMs): how feedforward neuron activations globally influence model outputs. We formally establish and empirically validate a global linear relationship between neuron activations and model predictions. To quantify this global controllability, we propose the Neuron Empirical Gradient (NEG)—a novel metric capturing the sensitivity of model outputs to perturbations in individual neuron activations—and design NeurGrad, an efficient algorithm for its estimation. Building upon NEG, we introduce Skill Neuron Probes to localize linguistic capabilities across diverse prompts. Extensive evaluation on knowledge probing and the MCEval8k multiple-choice benchmark demonstrates that NEG strongly correlates with language skills, enabling precise knowledge editing and fine-grained, neuron-level controllability analysis. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Although feed-forward neurons in pre-trained language models (PLMs) can store knowledge and their importance in influencing model outputs has been studied, existing work focuses on finding a limited set of neurons and analyzing their relative importance. However, the global quantitative role of activation values in shaping outputs remains unclear, hindering further advancements in applications like knowledge editing. Our study first investigates the numerical relationship between neuron activations and model output and discovers the global linear relationship between them through neuron interventions on a knowledge probing dataset. We refer to the gradient of this linear relationship as neuron empirical gradient (NEG), and introduce NeurGrad, an accurate and efficient method for computing NEG. NeurGrad enables quantitative analysis of all neurons in PLMs, advancing our understanding of neurons' controllability. Furthermore, we explore NEG's ability to represent language skills across diverse prompts via skill neuron probing. Experiments on MCEval8k, a multi-choice knowledge benchmark spanning various genres, validate NEG's representational ability. The data and code are released.

Problem

Research questions and friction points this paper is trying to address.

Quantify global linear role of neuron activations.

Develop NeurGrad for computing neuron empirical gradient.

Validate neuron representational ability across diverse prompts.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuron Empirical Gradient (NEG)

NeurGrad for computing NEG

Skill neuron probing technique

🔎 Similar Papers

2023-05-10ACM Computing SurveysCitations: 60