Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the challenges of deploying large language models on edge devices—such as smartphones—where memory, latency, and runtime flexibility are severely constrained. Building upon LLaMA, the authors develop a multilingual foundation model featuring a novel dynamic task-switching mechanism that requires no recompilation: task-specific LoRA adapters are injected at runtime into a frozen inference graph, combined with INT4 quantization and hardware-aware optimizations. The study further introduces multi-stream concurrent decoding to generate multiple stylistic responses in a single forward pass and proposes a tree-structured, draft-free Dynamic Speculative Self-Decoding (DS2D) strategy. Evaluated on Samsung Galaxy S24/S25 devices, the system achieves 4–6× reductions in memory usage and latency, up to 2.3× faster decoding, and as much as 6× lower latency for multi-style generation across nine languages and eight tasks, while maintaining consistent accuracy.

Technology Category

Application Category

📝 Abstract
Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
Problem

Research questions and friction points this paper is trying to address.

Edge deployment
on-device inference
large language models
memory constraints
latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-LoRA
on-device LLM
multi-stream decoding
Dynamic Self-Speculative Decoding
hardware-aware optimization
🔎 Similar Papers
S
Sravanth Kodavanti
Samsung Research Institute Bangalore, India
S
Sowmya Vajrala
Samsung Research Institute Bangalore, India
S
Srinivas Miriyala
Samsung Research Institute Bangalore, India
U
Utsav Tiwari
Samsung Research Institute Bangalore, India
Uttam Kumar
Uttam Kumar
IIIT Bangalore
Data MiningRemote SensingDigital Image ProcessingSpatio-temporal Data Analysis
U
Utkarsh Kumar Mahawar
Samsung Research Institute Bangalore, India
A
Achal Pratap Singh
Samsung Research Institute Bangalore, India
A
Arya D
Samsung Research Institute Bangalore, India
N
Narendra Mutyala
Samsung Research Institute Bangalore, India
V
Vikram Nelvoy Rajendiran
Samsung Research Institute Bangalore, India
S
Sharan Kumar Allur
Samsung Research Institute Bangalore, India
E
Euntaik Lee
Samsung Electronics, Suwon, South Korea
D
Dohyoung Kim
Samsung Electronics, Suwon, South Korea
H
HyeonSu Lee
Samsung Electronics, Suwon, South Korea
G
Gyusung Cho
Samsung Electronics, Suwon, South Korea
J
JungBae Kim
Samsung Electronics, Suwon, South Korea