Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the challenges of deploying large language models on edge devices—such as smartphones—where memory, latency, and runtime flexibility are severely constrained. Building upon LLaMA, the authors develop a multilingual foundation model featuring a novel dynamic task-switching mechanism that requires no recompilation: task-specific LoRA adapters are injected at runtime into a frozen inference graph, combined with INT4 quantization and hardware-aware optimizations. The study further introduces multi-stream concurrent decoding to generate multiple stylistic responses in a single forward pass and proposes a tree-structured, draft-free Dynamic Speculative Self-Decoding (DS2D) strategy. Evaluated on Samsung Galaxy S24/S25 devices, the system achieves 4–6× reductions in memory usage and latency, up to 2.3× faster decoding, and as much as 6× lower latency for multi-style generation across nine languages and eight tasks, while maintaining consistent accuracy.

Technology Category

Application Category

📝 Abstract

Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.

Problem

Research questions and friction points this paper is trying to address.

Edge deployment

on-device inference

large language models

memory constraints

latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-LoRA

on-device LLM

multi-stream decoding