Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

📅 2026-06-02

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the inefficiency in large language model (LLM) inference caused by excessive computation and communication overhead from low-information tokens. The authors propose Entropy Gate, a novel framework that introduces thermodynamic entropy quenching into LLM token compression. By integrating statistical, structural, and positional features into a multi-factor information energy metric, the method employs adaptive temperature scheduling and Boltzmann-based survival probabilities to dynamically prune low-energy tokens, augmented with semantic fidelity gating and context deduplication. Theoretically, selecting tokens in descending order of information energy maximizes semantic retention and approaches the information-theoretic compression limit. Experiments demonstrate 40–60% compression rates across five prompt types while maintaining semantic similarity (SE > 0.80); with energy-squared amplification and external memory, agent tasks achieve total compression of 88–96%, supporting stateless, model-agnostic deployment.

📝 Abstract

LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(τ) = T_0 / (1 + ατ)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $θ$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E > 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.

Problem

Research questions and friction points this paper is trying to address.

token compression

redundancy

large language models

information entropy

LLM pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy Quenching

Token Compression

Information Energy