🤖 AI Summary
The high computational cost of large language model (LLM) inference impedes sustainable deployment, and existing optimization methods lack theoretical grounding in cognitive science. Method: We propose Cognitive Load-Aware Inference (CLAI), the first framework to formalize Cognitive Load Theory (CLT) into computable LLM metrics—quantifying intrinsic, extraneous, and germane cognitive load—to enable intelligent token allocation. Grounded in cognitive economics, CLAI endows models with autonomous problem decomposition; integrated with a neuro-symbolic architecture, it achieves cognitive regulation via two complementary pathways: CLAI-Prompt (a zero-shot meta-prompt) and CLAI-Tune (strategy-oriented fine-tuning). Contribution/Results: On complex reasoning, long-context question answering, and code generation tasks, CLAI reduces token consumption by up to 45% without sacrificing accuracy, while significantly improving robustness and human-like reasoning capabilities.
📝 Abstract
The escalating computational costs of Large Language Model (LLM) inference have become a critical barrier to their widespread and sustainable deployment. While existing optimization strategies are effective, they are predominantly based on statistical heuristics or architectural modifications, lacking a guiding cognitive theory to manage the inference process itself. This paper aims to bridge this gap by introducing a novel paradigm: the Cognitive Load-Aware Inference (CLAI) framework, which operationalizes principles from Cognitive Load Theory (CLT) and neuroscience for LLM inference. We formalize the concepts of Intrinsic Cognitive Load, Extraneous Cognitive Load, and Germane Cognitive Load into quantifiable LLM metrics ($ICL_{LLM}$, $ECL_{LLM}$, and $GCL_{LLM}$), thereby reframing the inference process as a cognitive economics optimization problem: based on the intrinsic complexity of a problem ($ICL_{LLM}$), minimize wasteful computation ($ECL_{LLM}$), and strategically allocate the token budget to productive reasoning ($GCL_{LLM}$). We propose two implementation paths: CLAI-Prompt, a zero-shot method that guides a base LLM through cognitive control steps via a structured meta-prompt, and CLAI-Tune, a fine-tuned model that internalizes these principles for spontaneous cognitive economy. Across a range of benchmarks in complex reasoning, long-context question answering, and code generation, our methods achieve significant reductions in token consumption (up to 45%) without sacrificing accuracy. Furthermore, CLAI-Tune exhibits an emergent ability to autonomously decompose difficult problems, a key characteristic of human expert cognition. This work demonstrates that by emulating the brain's resource management strategies, we can build more efficient, robust, and capable artificial intelligence systems.