🤖 AI Summary
This work addresses the challenge of deploying diffusion-based text-to-video generation models on resource-constrained mobile devices. To overcome severe computational and memory limitations, we propose three novel optimizations: (1) Linear Proportional Skipping (LPL) for accelerated sampling, (2) Temporal-Dimension Token Merging (TDTM) to compress attention computation, and (3) Concurrent Inference with Dynamic Loading (CI-DL) for memory-controllable on-device inference. Integrated into the Open-Sora framework, our approach combines model lightweighting, attention sparsification, and dynamic chunked loading. We achieve, for the first time, end-to-end, cloud-free, real-time high-definition video generation (24 fps at 256×256) on an iPhone 15 Pro. The generated video quality matches that of the GPU-based Open-Sora implementation, while reducing VRAM usage by 72% and inference latency by 68%, significantly enhancing privacy preservation and deployment efficiency.
📝 Abstract
We present On-device Sora, a first pioneering solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. Building on Open-Sora, On-device Sora applies three novel techniques to address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations demonstrate that it is capable of generating high-quality videos on the device, comparable to those produced by Open-Sora running on high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices, expanding accessibility, ensuring user privacy, reducing dependence on cloud infrastructure, and lowering associated costs. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation capabilities on commodity mobile and embedded devices. The code implementation is publicly available at an GitHub repository: https://github.com/eai-lab/On-device-Sora.