Neodragon: Mobile Video Generation using Diffusion Transformer

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency of existing Transformer-based text-to-video (T2V) models on mobile devices, this work introduces the first lightweight, offline T2V system optimized for mobile hardware—specifically Qualcomm Hexagon NPUs. Methodologically, it integrates four novel techniques: text encoder distillation, asymmetric decoder distillation, structured pruning of the MMDiT module, and stepwise distillation for accelerated denoising. Additionally, it incorporates DistilT5 for efficient text encoding, a latent-space VAE, DMD step distillation, pyramid flow matching, and QuickSRNet for super-resolution. The complete model comprises only 4.945 billion parameters, peaks at 3.5 GB memory usage, and generates a 49-frame (2-second) video at 640×1024 resolution in just 6.7 seconds. It achieves a VBench overall score of 81.61, enabling high-fidelity, low-latency, private, and cost-effective on-device video generation.

Technology Category

Application Category

📝 Abstract
We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at the 640x1024 resolution directly on a Qualcomm Hexagon NPU in a record 6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient and high-fidelity video synthesis. We achieve this through four key technical contributions: (1) Replacing the original large 4.762B T5xxl Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabled through a novel Text-Encoder Distillation procedure. (2) Proposing an Asymmetric Decoder Distillation approach allowing us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the generation pipeline. (3) Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process. (4) Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using DMD adapted for pyramidal flow-matching, thereby substantially accelerating video generation. When paired with an optimised SSD1B first-frame image generator and QuickSRNet for 2x super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61. By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website: https://qualcomm-ai-research.github.io/neodragon
Problem

Research questions and friction points this paper is trying to address.

Optimizing text-to-video generation for mobile hardware efficiency
Reducing model size and computational demands for on-device synthesis
Achieving high-fidelity video generation without cloud dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used Text-Encoder Distillation to reduce model size
Applied Asymmetric Decoder Distillation for efficient decoding
Employed step distillation to reduce denoiser evaluations
A
Animesh Karnewar
Denis Korzhenkov
Denis Korzhenkov
Qualcomm AI Research
I
Ioannis Lelekas
Adil Karjauv
Adil Karjauv
Machine Learning R&D, Qualcomm
machine learning
N
N. Fathima
H
Hanwen Xiong
V
Vancheeswaran Vaidyanathan
W
Will Zeng
R
Rafael Esteves
T
Tushar Singhal
F
F. Porikli
Mohsen Ghafoorian
Mohsen Ghafoorian
Sr. Staff Computer Vision Research Scientist, Qualcomm
Efficient Machine LearningComputer VisionVideo Diffusion3D Vision
A
A. Habibian