Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

📅 2026-06-02
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This work addresses the inefficiency in large language model (LLM) inference caused by excessive computation and communication overhead from low-information tokens. The authors propose Entropy Gate, a novel framework that introduces thermodynamic entropy quenching into LLM token compression. By integrating statistical, structural, and positional features into a multi-factor information energy metric, the method employs adaptive temperature scheduling and Boltzmann-based survival probabilities to dynamically prune low-energy tokens, augmented with semantic fidelity gating and context deduplication. Theoretically, selecting tokens in descending order of information energy maximizes semantic retention and approaches the information-theoretic compression limit. Experiments demonstrate 40–60% compression rates across five prompt types while maintaining semantic similarity (SE > 0.80); with energy-squared amplification and external memory, agent tasks achieve total compression of 88–96%, supporting stateless, model-agnostic deployment.
📝 Abstract
LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(τ) = T_0 / (1 + ατ)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $θ$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E > 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.
Problem

Research questions and friction points this paper is trying to address.

token compression
redundancy
large language models
information entropy
LLM pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy Quenching
Token Compression
Information Energy
Semantic Fidelity
Model-Agnostic
Justice Owusu Agyemang
Justice Owusu Agyemang
Cybersecurity Researcher
Internet of Things (IoT)Networks and Applications SecurityArtificial IntelligenceApplied Cryptography
J
Jerry John Kponyo
Quantum and Assistive Technologies Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
K
Kwame Opuni-Boachie Obour Agyekum
VIA Cybersecurity Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
F
Francisca Adoma Acheampong
VIA Cybersecurity Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
K
Kwame Agyeman-Prempeh Agyekum
VIA Cybersecurity Lab, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
James Dzisi Gadze
James Dzisi Gadze
Associate Professor, KNUST.
Mobile and wireless communication