Uncovering Model Processing Strategies with Non-Negative Per-Example Fisher Factorization

📅 2023-10-07

📈 Citations: 1

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work investigates the implicit strategic neural computation mechanisms underlying large language models (LLMs) in text tasks. To this end, we propose NPEFF—the first non-negative per-sample Fisher decomposition framework—which attributes model decisions to interpretable rank-1 positive semidefinite strategic components, enabling both modeling and intervention at the strategy level. NPEFF integrates non-negative matrix factorization, per-sample Fisher information estimation, and strategy-directed parameter perturbation. We benchmark against baselines including sparse autoencoders and gradient clustering. Empirical evaluation across multilingual LLMs and diverse tasks demonstrates that: (i) the extracted strategies exhibit high human interpretability; (ii) they support selective strategic interference; and (iii) they provide novel analytical tools for studying catastrophic forgetting side effects and the mechanisms of in-context learning.

📝 Abstract

We introduce NPEFF (Non-Negative Per-Example Fisher Factorization), an interpretability method that aims to uncover strategies used by a model to generate its predictions. NPEFF decomposes per-example Fisher matrices using a novel decomposition algorithm that learns a set of components represented by learned rank-1 positive semi-definite matrices. Through a combination of human evaluation and automated analysis, we demonstrate that these NPEFF components correspond to model processing strategies for a variety of language models and text processing tasks. We further show how to construct parameter perturbations from NPEFF components to selectively disrupt a given component's role in the model's processing. Along with conducting extensive ablation studies, we include experiments to show how NPEFF can be used to analyze and mitigate collateral effects of unlearning and use NPEFF to study in-context learning. Furthermore, we demonstrate the advantages of NPEFF over baselines such as gradient clustering and using sparse autoencoders for dictionary learning over model activations.

Problem

Research questions and friction points this paper is trying to address.

Uncover model strategies for generating predictions

Decompose Fisher matrices to identify processing components

Analyze and mitigate effects of unlearning in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes Fisher matrices with novel algorithm

Learns rank-1 positive semi-definite components

Constructs parameter perturbations to disrupt strategies

🔎 Similar Papers

No similar papers found.