Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices

📅 2024-10-04
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Deploying lightweight large language models (LLMs) such as Gemini Nano and LLaMA2-7B on commercial smartphones for privacy-sensitive, on-device inference remains challenging due to hardware-system bottlenecks under real-world constraints. Method: We conduct a systematic, multi-dimensional empirical evaluation across user-centric metrics (token throughput, time-to-first-token, power consumption), system resource utilization (memory bandwidth, GPU/NPU occupancy), and hardware-level controls (DVFS policies), benchmarking mainstream inference engines—including llama.cpp and MLC-LLM—on diverse mobile SoCs (Snapdragon, Dimensity, Apple A-series). Contribution/Results: This work is the first to identify and characterize hardware-system co-bottlenecks induced by LLM workloads on modern mobile SoCs, establishing memory bandwidth and energy efficiency as the primary limiting factors. Our analysis provides empirically grounded insights and actionable optimization pathways for on-device model compression, inference engine design, and AI-accelerator architecture development.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. We evaluate both metrics that affect user experience, including token throughput, latency, and battery consumption, as well as factors critical to developers, such as resource utilization, DVFS strategies, and inference engines. In addition, we provide a detailed analysis of how these hardware capabilities and system dynamics affect on-device LLM performance, which may help developers identify and address bottlenecks for mobile LLM applications. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance on commercial mobile devices
Analyzing user experience metrics and developer factors
Comparing mobile SoCs for LLM workload handling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLM performance on commercial mobile devices
Analyzes hardware impact on mobile LLM efficiency
Compares SoC performance across major vendors