🤖 AI Summary
To address computational redundancy arising from large language models (LLMs) blindly processing entire inputs, this paper proposes a dynamic context truncation mechanism enabling models to autonomously detect and terminate redundant inference. Methodologically, it integrates attention-head probing, a lightweight state classifier, and prompt engineering to achieve real-time, representation-driven truncation. The key contribution is the first identification of detectable “sufficiency signals” within attention heads—latent indicators of internal comprehension states that naturally guide processing decisions and exhibit prompt-controllable self-assessment capability. Evaluated on six long-context QA benchmarks (up to 40K tokens), the method reduces input tokens by 32.5% on average (equivalent to 1.33× compression) while improving accuracy by 1.3%, significantly outperforming existing context compression approaches.
📝 Abstract
Large language models (LLMs) process entire input contexts indiscriminately, which is inefficient in cases where the information required to answer a query is localized within the context. We present dynamic context cutoff, a human-inspired method enabling LLMs to self-terminate processing upon acquiring sufficient task-relevant information. Through analysis of model internals, we discover that specific attention heads inherently encode"sufficiency signals"- detectable through lightweight classifiers - that predict when critical information has been processed. This reveals a new efficiency paradigm: models' internal understanding naturally dictates processing needs rather than external compression heuristics. Comprehensive experiments across six QA datasets (up to 40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B0-70B) demonstrate 1.33x average token reduction while improving accuracy by 1.3%. Furthermore, our method demonstrates better performance with the same rate of token reduction compared to other context efficiency methods. Additionally, we observe an emergent scaling phenomenon: while smaller models require require probing for sufficiency detection, larger models exhibit intrinsic self-assessment capabilities through prompting.