Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neuron polysemy in large language models (LLMs) impedes precise concept intervention, as conventional discrete “one-neuron-one-concept” attribution fails to capture nuanced semantic associations. Method: We propose NeuronLens, the first continuous neuron attribution framework grounded in activation magnitude ranges. Departing from binary attribution, we empirically observe that neuron activations across diverse semantic concepts follow Gaussian distributions; leveraging this, NeuronLens performs fine-grained, discriminative multi-concept attribution via statistical modeling (Gaussian fitting), activation distribution characterization, and range-aware intervention—universally applicable to encoder-decoder LLMs. Contribution/Results: Experiments demonstrate that NeuronLens significantly reduces unintended side effects during concept manipulation and consistently outperforms state-of-the-art neuron editing methods on multiple text classification benchmarks. It establishes a novel paradigm for enhancing both interpretability and controllability of LLMs through continuous, distribution-aware neuron attribution.

Technology Category

Application Category

📝 Abstract
Interpreting and controlling the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Recent efforts have primarily focused on identifying and manipulating neurons by establishing discrete mappings between neurons and semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. This makes precise control challenging and complicates downstream interventions. Through an in-depth analysis of both encoder and decoder-based LLMs across multiple text classification datasets, we uncover that while individual neurons encode multiple concepts, their activation magnitudes vary across concepts in distinct, Gaussian-like patterns. Building on this insight, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise control for manipulation of targeted concepts, outperforming existing methods.
Problem

Research questions and friction points this paper is trying to address.

Improving LLM trustworthiness and utility
Handling polysemanticity in neuron encoding
Enhancing neuron activation control precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Range-based neuron interpretation
Gaussian-like activation patterns
NeuronLens manipulation framework
🔎 Similar Papers
No similar papers found.