I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data?

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates whether large language models (LLMs) spontaneously acquire human-interpretable, discrete semantic concepts solely through next-token prediction. Method: We propose a generative model grounded in latent conceptual variables and establish, for the first time under non-invertible mappings, theoretical identifiability between LLM representations and human-defined concepts. We prove that hidden-layer activations approximately correspond to linear transformations of the log-posterior probabilities over concepts—strongly supporting the linear representation hypothesis. Our approach integrates information-theoretic analysis, Bayesian inference, concept-driven modeling, and empirical representation decoding. Contribution/Results: We validate our framework on Pythia, Llama, and DeepSeek models, as well as synthetic data, demonstrating that activations across layers can be linearly decoded into fine-grained semantic concepts with accuracy significantly surpassing baselines. These findings confirm the existence of extractable, consistent, and interpretable conceptual structure within LLMs, bridging the gap between statistical learning and conceptual understanding.

Technology Category

Application Category

📝 Abstract

The remarkable achievements of large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. This is as opposed to explanations of their capabilities based on their ability to perform relatively simple manipulations of vast volumes of data. To illuminate the distinction between these explanations, we introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables. Under mild conditions, even when the mapping from the latent space to the observed space is non-invertible, we establish an identifiability result: the representations learned by LLMs through next-token prediction can be approximately modeled as the logarithm of the posterior probabilities of these latent discrete concepts, up to an invertible linear transformation. This theoretical finding not only provides evidence that LLMs capture underlying generative factors, but also strongly reinforces the linear representation hypothesis, which posits that LLMs learn linear representations of human-interpretable concepts. Empirically, we validate our theoretical results through evaluations on both simulation data and the Pythia, Llama, and DeepSeek model families.

Problem

Research questions and friction points this paper is trying to address.

Explores if next-token prediction captures human-interpretable concepts.

Introduces a model linking tokens to latent discrete variables.

Validates LLMs learn linear representations of interpretable concepts.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative model with human-interpretable latent variables

Identifiability of latent concepts via next-token prediction

Validation on simulation and LLM families

🔎 Similar Papers

No similar papers found.

Authors to Follow