Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work investigates the “silent reasoning” capability of large language models (LLMs)—i.e., implicit, latent-space inference that operates without explicit intermediate text, transcending context-window boundaries and linguistic conventions—and evaluates its implications for safety-critical scenarios such as covert planning and goal-directed behavior. To this end, we introduce a cross-lingual implicit reasoning benchmark comprising 4,000 questions, requiring responses in non-English tokens to suppress surface-level heuristics like chain-of-thought. We propose the first evaluation paradigm for latent-space reasoning, integrating zero-shot multi-model assessment, difficulty-scaling analysis, and response-trigger control experiments. Empirical validation across 18 state-of-the-art models confirms the ubiquity of this capability: GPT-4.5 achieves 74.7% accuracy, substantially outperforming Grok-2 (67.2%) and Llama 3.1 405B (65.6%). Notably, several models exhibit systematic heuristic avoidance, suggesting non-trivial internal reasoning mechanisms beyond token-level pattern matching.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally by generating explicit token sequences like chains of thought. Significant progress in enhancing reasoning abilities has been made by scaling test-time compute. However, understanding and quantifying model-internal reasoning abilities - the inferential"leaps"models make between individual token predictions - remains crucial. This study introduces a benchmark (n = 4,000 items) designed to quantify model-internal reasoning in different domains. We achieve this by having LLMs indicate the correct solution to reasoning problems not through descriptive text, but by selecting a specific language of their initial response token that is different from English, the benchmark language. This not only requires models to reason beyond their context window, but also to overrise their default tendency to respond in the same language as the prompt, thereby posing an additional cognitive strain. We evaluate a set of 18 LLMs, showing significant performance variations, with GPT-4.5 achieving the highest accuracy (74.7%), outperforming models like Grok-2 (67.2%), and Llama 3.1 405B (65.6%). Control experiments and difficulty scaling analyses suggest that while LLMs engage in internal reasoning, we cannot rule out heuristic exploitations under certain conditions, marking an area for future investigation. Our experiments demonstrate that LLMs can"think"via latent-space computations, revealing model-internal inference strategies that need further understanding, especially regarding safety-related concerns such as covert planning, goal-seeking, or deception emerging without explicit token traces.

Problem

Research questions and friction points this paper is trying to address.

Quantify internal reasoning abilities in large language models

Benchmark latent-space reasoning across different domains

Assess safety concerns from covert model-internal inferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for latent-space reasoning abilities

Non-English token selection for solution indication

Evaluates internal reasoning via cognitive strain

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?