Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high inference latency and low throughput in real-time deep recommendation systems, this paper proposes a modeling-system co-designed lightweight online inference framework. Our method integrates lightweight neural architecture design, structured pruning, and INT8 weight quantization into a unified compression paradigm; further, we develop a real-time load-aware elastic inference scheduling mechanism and a heterogeneous execution engine (incorporating Triton and TVM) that jointly leverages CPU, GPU, and ASIC resources. Under strict accuracy preservation—i.e., no measurable degradation in recommendation quality—the framework reduces end-to-end inference latency by over 70% and increases throughput by 2.1×. It has been deployed in production to support industrial-scale online services with sustained throughput exceeding 10 million queries per second (QPS), effectively overcoming critical performance bottlenecks hindering the deployment of deep recommendation models in latency-sensitive scenarios.

Technology Category

Application Category

📝 Abstract
With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality. This paper addresses the high computational cost and resource bottlenecks of deep learning models in real-time settings by proposing a combined set of modeling- and system-level acceleration and optimization strategies. At the model level, we dramatically reduce parameter counts and compute requirements through lightweight network design, structured pruning, and weight quantization. At the system level, we integrate multiple heterogeneous compute platforms and high-performance inference libraries, and we design elastic inference scheduling and load-balancing mechanisms based on real-time load characteristics. Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput, offering a practical solution for deploying large-scale online recommendation services.
Problem

Research questions and friction points this paper is trying to address.

Reduce inference latency in real-time recommendation systems
Increase system throughput without sacrificing recommendation quality
Optimize computational cost and resource bottlenecks in deep learning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight network design reduces parameters
Heterogeneous compute platforms integration
Elastic inference scheduling balances load
🔎 Similar Papers
No similar papers found.
J
Junli Shao
College of Literature Science, and the Arts University of Michigan, Ann Arbor, USA
J
Jing Dong
Fu Foundation School of Engineering and Applied Science Columbia University, New York, NY, USA
D
Dingzhou Wang
Pratt School of Engineer, Duke University, Durham, NC, USA
K
Kowei Shih
Independent Researcher, Shenzhen, China
D
Dannier Li
School of Computing, University of Nebraska - Lincoln Lincoln, NE, USA
Chengrui Zhou
Chengrui Zhou
Columbia University