Emergent Introspective Awareness in Large Language Models

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 21

✨ Influential: 4

🤖 AI Summary

This study investigates whether large language models possess genuine introspective capabilities rather than merely generating superficially plausible but fabricated responses. By injecting known conceptual representations into the model’s internal activations and combining self-report analyses with instruction-guided activation modulation, the work presents the first systematic intervention to probe and validate a model’s awareness of its own internal states. The findings reveal that Claude Opus 4 and 4.1 can, under specific conditions, accurately identify injected content, distinguish between self-generated and externally prefilled information, and modulate their internal representations according to instructions—suggesting a measurable degree of introspective awareness. This research establishes a novel methodological framework and provides empirical evidence for evaluating self-awareness in language models.

Technology Category

Application Category

📝 Abstract

We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to"think about"a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today's models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.

Problem

Research questions and friction points this paper is trying to address.

introspection

large language models

internal states

self-awareness

mental representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

introspective awareness

activation injection

internal representation recall

self-monitoring

controlled activation modulation

🔎 Similar Papers

No similar papers found.

Authors to Follow