Towards eliciting latent knowledge from LLMs with mechanistic interpretability

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the extractability of latent knowledge in large language models (LLMs) to enhance their trustworthiness and controllability. We introduce Taboo, a novel benchmark for “taboo-word probing,” which systematically evaluates whether implicit knowledge—specifically, target words never observed during training—can be reliably localized and decoded. For the first time, we systematically adapt mechanistic interpretability techniques—including Logit Lens, sparse autoencoders (SAEs), and activation-space projections—to this task, and integrate them into an end-to-end automated decoding framework that bridges black-box probing and mechanistic analysis. Experiments in a controlled proof-of-concept setting demonstrate that both Logit Lens and SAE-based approaches successfully reconstruct hidden vocabulary, confirming that latent knowledge is structured and decodable. Our work establishes a new paradigm for auditing internal LLM knowledge and provides a scalable, interpretable toolchain for knowledge extraction and verification.

Technology Category

Application Category

📝 Abstract
As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.
Problem

Research questions and friction points this paper is trying to address.

Eliciting latent knowledge from deceptive LLMs
Developing interpretability techniques to uncover hidden secrets
Ensuring trustworthy deployment of advanced language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Taboo model to hide secret word
Applies logit lens and sparse autoencoders
Tests black-box and interpretability methods
🔎 Similar Papers
No similar papers found.