Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The internal mechanisms by which large language models (LLMs) represent linguistic knowledge remain poorly understood. Method: We systematically investigate LLMs’ structured representation capabilities across six core linguistic dimensions—phonetics, phonology, morphology, syntax, semantics, and pragmatics. We propose a sparse autoencoder–driven interpretable feature extraction framework, construct minimal-pair and counterfactual datasets, and perform activation-based causal interventions. Contribution/Results: We introduce two novel metrics—Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)—enabling fine-grained, intervention-based, and causally verifiable linguistic feature analysis across all six dimensions for the first time. Empirical results demonstrate that LLMs possess genuine, structurally organized linguistic knowledge. Our method enables precise, causal control over complex phenomena such as coreference resolution and metaphor generation, achieving an average 32.7% improvement in intervention accuracy.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel in tasks that require complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Previous work on linguistic mechanisms has been limited by coarse granularity, insufficient causal analysis, and a narrow focus. In this study, we present a systematic and comprehensive causal investigation using sparse auto-encoders (SAEs). We extract a wide range of linguistic features from six dimensions: phonetics, phonology, morphology, syntax, semantics, and pragmatics. We extract, evaluate, and intervene on these features by constructing minimal contrast datasets and counterfactual sentence datasets. We introduce two indices-Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)-to measure the ability of linguistic features to capture and control linguistic phenomena. Our results reveal inherent representations of linguistic knowledge in LLMs and demonstrate the potential for controlling model outputs. This work provides strong evidence that LLMs possess genuine linguistic knowledge and lays the foundation for more interpretable and controllable language modeling in future research.

Problem

Research questions and friction points this paper is trying to address.

Interpreting linguistic features in LLMs

Investigating causal mechanisms in LLMs

Enhancing interpretability and control in language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Auto-Encoders for feature extraction

Minimal contrast and counterfactual datasets

FRC and FIC indices for measurement

🔎 Similar Papers

No similar papers found.

Authors to Follow