Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

📅 2026-05-31

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the limitations of large language models (LLMs) in implicit reasoning, which stem from the poor interpretability of their continuous hidden states and consequently hinder reliability and controllability. The authors propose a training-free decoding intervention framework that, for the first time, translates insights from mechanistic interpretability—encompassing structural probing, causal analysis, and geometric representation—into practical strategies for optimizing reasoning. By leveraging semantic and geometric priors to dynamically guide the decoding trajectory, the method significantly enhances reasoning accuracy across diverse model scales and tasks. This approach effectively unlocks the latent capabilities of LLMs, achieving a unified improvement in both interpretability and performance without requiring additional training.

📝 Abstract

Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

Problem

Research questions and friction points this paper is trying to address.

latent reasoning

interpretability

controllability

black box

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

latent reasoning

interpretability-guided intervention

causal probing