🤖 AI Summary
Deploying large language models (LLMs) on edge devices is hindered by severe computational and memory constraints, as well as stringent real-time requirements. To address this, we propose HOLA, an end-to-end optimization framework featuring a novel synergistic mechanism between Hierarchical Speculative Decoding (HSD) and Adaptive Retrieval-Augmented Generation (AdaComp-RAG), integrated with LoRA-based structured pruning, quantization, and LoBi parameter-efficient fusion. HOLA jointly optimizes inference speed and accuracy while enabling dynamic workload adaptation and cross-device scalable deployment. Evaluated on GSM8K and ARC benchmarks, HOLA achieves +17.6% EMA and +10.5% MCA improvements over baselines. On resource-constrained edge platforms—including Jetson Nano—it significantly reduces latency and memory footprint, demonstrating the feasibility of low-power, low-latency, high-quality LLM inference at the edge.
📝 Abstract
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.