Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address the degradation in accuracy and efficiency of long-range reasoning in large language models (LLMs) caused by limited context length, this paper proposes the Threaded Inference Model (TIM) and its runtime system TIMRUN. Methodologically, TIM is the first to formalize natural language inference as a structured reasoning tree that jointly captures depth and breadth; it introduces rule-driven subtask pruning and selective KV cache retention to overcome positional encoding limitations and GPU memory constraints. It further supports recursive task decomposition, multi-hop tool invocation, and GPU memory paging reuse. Experimental results demonstrate that TIM maintains high accuracy on mathematical reasoning and multi-hop retrieval tasks—even with 90% KV cache compression—while sustaining stable throughput. It significantly outperforms conventional window-based models, enabling sustainable long-range reasoning with near-unbounded working memory.

Technology Category

Application Category

📝 Abstract

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

Problem

Research questions and friction points this paper is trying to address.

Overcoming context limits in LLMs for accurate reasoning

Enabling long-horizon structured reasoning beyond memory constraints

Improving efficiency in multi-hop tool calls and retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Thread Inference Model enables recursive problem solving

TIMRUN runtime supports unlimited working memory

Rule-based pruning optimizes GPU memory usage

🔎 Similar Papers

No similar papers found.