Interpreting Brain Responses to Language with Sparse Features from Language Models

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the challenge of precisely characterizing linguistic features represented in the human language cortex while avoiding vague associations between brain activity and black-box language models. To this end, it proposes an enhanced sparse encoding framework that, for the first time, integrates interpretable features extracted by hierarchical sparse autoencoders (SAEs) together with surprisal—replacing conventional dense hidden states. Applying this approach to 7T fMRI data, the model not only successfully replicates neural response patterns linked to processing difficulty and semantic abstractness but also identifies a novel voxel population selectively sensitive to person-related content. Crucially, the results demonstrate that the brain’s language network is best predicted by the most general and interpretable sparse features from language models, revealing a non-trivial correspondence between neural and artificial representations.

📝 Abstract

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

Problem

Research questions and friction points this paper is trying to address.

brain-language alignment

neural interpretation

language representation

sparse features

surprisal

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoder

encoding model

model-brain alignment