InverseScope: Scalable Activation Inversion for Interpreting Large Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing interpretability methods for large language models (LLMs) rely on restrictive assumptions—particularly sparse linear mappings—limiting their applicability to high-dimensional, nonlinear internal representations. Method: This paper proposes an activation-inversion-based representational interpretation framework that abandons sparse linear assumptions and instead employs a conditional generative inversion architecture centered on Conditional Variational Autoencoders (CVAEs). It models the input distribution capable of eliciting target neural activations via gradient-guided sampling. Additionally, we introduce a quantitative evaluation protocol based on Feature Consistency Rate (FCR) to enable reproducible and verifiable interpretability assessment. Results: Evaluated on million-parameter-scale models including LLaMA-3 and Qwen, our method substantially improves sample efficiency in high-dimensional activation inversion, achieving an average 37% gain in FCR. It delivers a lightweight-assumption, systematic, and quantitatively rigorous framework for LLM representation interpretation.

Technology Category

Application Category

📝 Abstract

Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded features. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous methods. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using feature consistency rate computed over the sampled inputs. InverseScope scales inversion-based interpretability methods to larger models and practical tasks, enabling systematic and quantitative analysis of internal representations in real-world LLMs.

Problem

Research questions and friction points this paper is trying to address.

Interpret internal representations of large language models

Overcome inefficiency in high-dimensional activation sampling

Enable scalable quantitative analysis of LLM features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Assumption-light framework for activation interpretation

Conditional generation improves sample efficiency

Quantitative protocol tests interpretability hypotheses

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models