🤖 AI Summary
In multi-turn dialogue, frequent switching between LoRA adapters necessitates recomputation of historical key-value (KV) caches, severely degrading inference efficiency. To address this, we propose Activated LoRA (aLoRA), the first context-aware, on-demand LoRA activation mechanism: it dynamically injects adapter weights only for currently active tokens, fully reusing the base model’s KV cache and decoupling adapter scope from historical token dependencies. aLoRA achieves this via sequence-position-aware weight injection, modularized intrinsic training, and lightweight scheduling—enabling instantaneous invocation of specialized “intrinsic models.” Experiments demonstrate that aLoRA matches standard LoRA’s accuracy while substantially reducing multi-turn switching latency; average inference speed improves by 3.2×. This work establishes a new paradigm for modular, composable customization of large language models.
📝 Abstract
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is highly inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), which modifies the LoRA framework to only adapt weights for the tokens in the sequence emph{after} the aLoRA is invoked. This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the cache. This enables building what we call emph{intrinsics}, i.e. highly specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We use aLoRA to train a set of intrinsics models, demonstrating competitive accuracy with standard LoRA while achieving significant inference benefits.