Neuron Empirical Gradient: Discovering and Quantifying Neurons Global Linear Controllability

๐Ÿ“… 2024-12-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses a fundamental interpretability challenge in pretrained language models (PLMs): how feedforward neuron activations globally influence model outputs. We formally establish and empirically validate a global linear relationship between neuron activations and model predictions. To quantify this global controllability, we propose the Neuron Empirical Gradient (NEG)โ€”a novel metric capturing the sensitivity of model outputs to perturbations in individual neuron activationsโ€”and design NeurGrad, an efficient algorithm for its estimation. Building upon NEG, we introduce Skill Neuron Probes to localize linguistic capabilities across diverse prompts. Extensive evaluation on knowledge probing and the MCEval8k multiple-choice benchmark demonstrates that NEG strongly correlates with language skills, enabling precise knowledge editing and fine-grained, neuron-level controllability analysis. All code and data are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Although feed-forward neurons in pre-trained language models (PLMs) can store knowledge and their importance in influencing model outputs has been studied, existing work focuses on finding a limited set of neurons and analyzing their relative importance. However, the global quantitative role of activation values in shaping outputs remains unclear, hindering further advancements in applications like knowledge editing. Our study first investigates the numerical relationship between neuron activations and model output and discovers the global linear relationship between them through neuron interventions on a knowledge probing dataset. We refer to the gradient of this linear relationship as neuron empirical gradient (NEG), and introduce NeurGrad, an accurate and efficient method for computing NEG. NeurGrad enables quantitative analysis of all neurons in PLMs, advancing our understanding of neurons' controllability. Furthermore, we explore NEG's ability to represent language skills across diverse prompts via skill neuron probing. Experiments on MCEval8k, a multi-choice knowledge benchmark spanning various genres, validate NEG's representational ability. The data and code are released.
Problem

Research questions and friction points this paper is trying to address.

Quantify global linear role of neuron activations.
Develop NeurGrad for computing neuron empirical gradient.
Validate neuron representational ability across diverse prompts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuron Empirical Gradient (NEG)
NeurGrad for computing NEG
Skill neuron probing technique
๐Ÿ”Ž Similar Papers