LLMs on a Budget? Say HOLA

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Deploying large language models (LLMs) on edge devices is hindered by severe computational and memory constraints, as well as stringent real-time requirements. To address this, we propose HOLA, an end-to-end optimization framework featuring a novel synergistic mechanism between Hierarchical Speculative Decoding (HSD) and Adaptive Retrieval-Augmented Generation (AdaComp-RAG), integrated with LoRA-based structured pruning, quantization, and LoBi parameter-efficient fusion. HOLA jointly optimizes inference speed and accuracy while enabling dynamic workload adaptation and cross-device scalable deployment. Evaluated on GSM8K and ARC benchmarks, HOLA achieves +17.6% EMA and +10.5% MCA improvements over baselines. On resource-constrained edge platforms—including Jetson Nano—it significantly reduces latency and memory footprint, demonstrating the feasibility of low-power, low-latency, high-quality LLM inference at the edge.

Technology Category

Application Category

📝 Abstract

Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands posing a barrier for real-time applications in sectors like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and retrieval-augmented generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with LoBi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: 17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano--proving both scalable and production-ready.

Problem

Research questions and friction points this paper is trying to address.

High compute and memory demands hinder LLM deployment on edge devices

Existing solutions compromise speed or accuracy in LLM optimization

Need for scalable, efficient LLM frameworks for real-time applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Speculative Decoding for faster inference

AdaComp-RAG adjusts retrieval complexity dynamically

LoBi combines structured pruning and quantization

🔎 Similar Papers

A Large Language Model Guided Topic Refinement Mechanism for Short Text Modeling