Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address high inference latency and low throughput in real-time deep recommendation systems, this paper proposes a modeling-system co-designed lightweight online inference framework. Our method integrates lightweight neural architecture design, structured pruning, and INT8 weight quantization into a unified compression paradigm; further, we develop a real-time load-aware elastic inference scheduling mechanism and a heterogeneous execution engine (incorporating Triton and TVM) that jointly leverages CPU, GPU, and ASIC resources. Under strict accuracy preservation—i.e., no measurable degradation in recommendation quality—the framework reduces end-to-end inference latency by over 70% and increases throughput by 2.1×. It has been deployed in production to support industrial-scale online services with sustained throughput exceeding 10 million queries per second (QPS), effectively overcoming critical performance bottlenecks hindering the deployment of deep recommendation models in latency-sensitive scenarios.

Technology Category

Application Category

📝 Abstract

With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality. This paper addresses the high computational cost and resource bottlenecks of deep learning models in real-time settings by proposing a combined set of modeling- and system-level acceleration and optimization strategies. At the model level, we dramatically reduce parameter counts and compute requirements through lightweight network design, structured pruning, and weight quantization. At the system level, we integrate multiple heterogeneous compute platforms and high-performance inference libraries, and we design elastic inference scheduling and load-balancing mechanisms based on real-time load characteristics. Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput, offering a practical solution for deploying large-scale online recommendation services.

Problem

Research questions and friction points this paper is trying to address.

Reduce inference latency in real-time recommendation systems

Increase system throughput without sacrificing recommendation quality

Optimize computational cost and resource bottlenecks in deep learning models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight network design reduces parameters

Heterogeneous compute platforms integration

Elastic inference scheduling balances load

🔎 Similar Papers

No similar papers found.

Authors to Follow