🤖 AI Summary
This work addresses the inefficiency in large language model (LLM) inference caused by excessive computation and communication overhead from low-information tokens. The authors propose Entropy Gate, a novel framework that introduces thermodynamic entropy quenching into LLM token compression. By integrating statistical, structural, and positional features into a multi-factor information energy metric, the method employs adaptive temperature scheduling and Boltzmann-based survival probabilities to dynamically prune low-energy tokens, augmented with semantic fidelity gating and context deduplication. Theoretically, selecting tokens in descending order of information energy maximizes semantic retention and approaches the information-theoretic compression limit. Experiments demonstrate 40–60% compression rates across five prompt types while maintaining semantic similarity (SE > 0.80); with energy-squared amplification and external memory, agent tasks achieve total compression of 88–96%, supporting stateless, model-agnostic deployment.
📝 Abstract
LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching $-$ a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy $E(t)$ combining statistical, structural, and positional components. An adaptive quenching schedule $T(τ) = T_0 / (1 + ατ)$ removes tokens whose Boltzmann survival probability $p_i = \exp(-E_i / kT)$ falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below $θ$. We prove token selection by descending $E(t)$ maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit $\text{CR} \to 1 - I(P; T)/H(P)$. A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining $S_E > 0.80$, with energy-squared amplification $E \to E^2$ adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.