A Study on Inference Latency for Vision Transformers on Mobile Devices

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) suffer from high inference latency on mobile devices, yet existing work lacks systematic, empirical latency analysis across diverse architectures and platforms. Method: We conduct the first large-scale, real-world benchmarking study—evaluating 190 ViT and 102 CNN models across six mobile platforms using TensorFlow Lite and PyTorch Mobile. To address data scarcity, we propose a synthetic modeling approach to generate a diverse latency dataset comprising 1,000 ViT architectures. Leveraging this dataset, we design a generalizable latency prediction model capable of estimating inference latency for unseen ViT architectures with low error—meeting practical deployment requirements. Contribution/Results: This work introduces the first large-scale, cross-platform, open-source ViT latency dataset. It identifies key architectural factors governing mobile ViT latency and provides a reusable methodology—grounded in empirical evidence—for efficient model selection and deployment on resource-constrained devices.

Technology Category

Application Category

📝 Abstract
Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Quantitatively study vision transformers' mobile performance characteristics
Compare latency factors between ViTs and CNNs on mobile devices
Develop dataset to predict new ViT architectures' inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantitatively studied 190 real-world vision transformers
Developed dataset with 1000 synthetic ViTs latencies
Predicted new ViT inference latency with accuracy
🔎 Similar Papers
No similar papers found.
Z
Zhuojin Li
University of Southern California, Los Angeles, California, USA
M
Marco Paolieri
University of Southern California, Los Angeles, California, USA
Leana Golubchik
Leana Golubchik
University of Southern California
Performance Evaluation