On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the challenge of deploying diffusion-based text-to-video generation models on resource-constrained mobile devices. To overcome severe computational and memory limitations, we propose three novel optimizations: (1) Linear Proportional Skipping (LPL) for accelerated sampling, (2) Temporal-Dimension Token Merging (TDTM) to compress attention computation, and (3) Concurrent Inference with Dynamic Loading (CI-DL) for memory-controllable on-device inference. Integrated into the Open-Sora framework, our approach combines model lightweighting, attention sparsification, and dynamic chunked loading. We achieve, for the first time, end-to-end, cloud-free, real-time high-definition video generation (24 fps at 256×256) on an iPhone 15 Pro. The generated video quality matches that of the GPU-based Open-Sora implementation, while reducing VRAM usage by 72% and inference latency by 68%, significantly enhancing privacy preservation and deployment efficiency.

Technology Category

Application Category

📝 Abstract

We present On-device Sora, a first pioneering solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. Building on Open-Sora, On-device Sora applies three novel techniques to address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations demonstrate that it is capable of generating high-quality videos on the device, comparable to those produced by Open-Sora running on high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices, expanding accessibility, ensuring user privacy, reducing dependence on cloud infrastructure, and lowering associated costs. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation capabilities on commodity mobile and embedded devices. The code implementation is publicly available at an GitHub repository: https://github.com/eai-lab/On-device-Sora.

Problem

Research questions and friction points this paper is trying to address.

Efficient text-to-video generation on mobile devices

Reducing denoising steps with Linear Proportional Leap

Minimizing memory usage via Concurrent Inference with Dynamic Loading

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear Proportional Leap reduces denoising steps

Temporal Dimension Token Merging minimizes computation

Concurrent Inference with Dynamic Loading manages memory

🔎 Similar Papers

Grid Diffusion Models for Text-to-Video Generation