Knowing When to Stop: Dynamic Context Cutoff for Large Language Models

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address computational redundancy arising from large language models (LLMs) blindly processing entire inputs, this paper proposes a dynamic context truncation mechanism enabling models to autonomously detect and terminate redundant inference. Methodologically, it integrates attention-head probing, a lightweight state classifier, and prompt engineering to achieve real-time, representation-driven truncation. The key contribution is the first identification of detectable “sufficiency signals” within attention heads—latent indicators of internal comprehension states that naturally guide processing decisions and exhibit prompt-controllable self-assessment capability. Evaluated on six long-context QA benchmarks (up to 40K tokens), the method reduces input tokens by 32.5% on average (equivalent to 1.33× compression) while improving accuracy by 1.3%, significantly outperforming existing context compression approaches.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode"sufficiency signals"- detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Dynamic Scope Determination

Resource Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Context Truncation

Self-assessment of Information Sufficiency

Efficiency and Accuracy Improvement

🔎 Similar Papers

No similar papers found.

Authors to Follow