Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High CPU-GPU coordination overhead and unpredictable task execution times severely degrade inference latency in mobile DNN deployment. To address this, we propose a fine-grained heterogeneous collaborative execution framework. Our key contributions are: (1) a lightweight synchronization mechanism leveraging OpenCL shared virtual memory, drastically reducing data migration and synchronization costs; (2) a mobile-GPU-oriented kernel performance model coupled with a machine learning–driven execution time prediction model, enabling high-accuracy dynamic scheduling; and (3) a dynamic parallel partitioning strategy that optimizes computational load distribution across heterogeneous units. Evaluated on four mainstream mobile platforms, our framework achieves 1.89× and 1.75× speedups for linear and convolutional layer inference, respectively—approaching theoretical acceleration limits and effectively overcoming efficiency bottlenecks in mobile heterogeneous computing.

Technology Category

Application Category

📝 Abstract
Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pixel~5 smartphone).
Problem

Research questions and friction points this paper is trying to address.

Reducing mobile DNN inference latency through CPU-GPU collaboration
Overcoming synchronization overhead in CPU-GPU co-execution systems
Predicting execution times for CPU-GPU task allocation decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight synchronization mechanism using OpenCL SVM
Machine learning models for execution time prediction
Co-execution strategy selection for CPU-GPU acceleration
🔎 Similar Papers
No similar papers found.
Z
Zhuojin Li
University of Southern California, Los Angeles, California, USA
M
Marco Paolieri
University of Southern California, Los Angeles, California, USA
Leana Golubchik
Leana Golubchik
University of Southern California
Performance Evaluation