A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems

📅 2024-06-25

🏛️ ACM Transactions on Information Systems

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Lightweight recommendation systems (LERS) suffer from a lack of standardized evaluation protocols, unclear cross-task generalization capabilities, and insufficient empirical validation for edge deployment. Method: This paper introduces the first task-agnostic, hardware-software co-designed comprehensive evaluation framework for LERS. We propose a magnitude-based pruning baseline for universal embedding compression and systematically assess performance, efficiency, and transferability across collaborative filtering and content-based recommendation tasks. We further conduct the first end-to-end latency profiling of LERS inference on a Raspberry Pi 4. Contributions: (1) We uncover significant task-specific preferences in LERS design; (2) Our baseline substantially outperforms multiple state-of-the-art LERS methods despite its simplicity; (3) We identify critical CPU inference bottlenecks in resource-constrained settings; and (4) We fully open-source all code, models, and documentation to foster reproducible lightweight recommendation research.

Technology Category

Application Category

📝 Abstract

Since the creation of the Web, recommender systems (RSs) have been an indispensable personalization mechanism in information filtering. Most state-of-the-art RSs primarily depend on categorical features such as user and item IDs, and use embedding vectors to encode their information for accurate recommendations, resulting in an excessively large embedding table owing to the immense feature corpus. To prevent the heavily parameterized embedding table from harming RSs’ scalability, both academia and industry have seen increasing efforts compressing RS embeddings, and this trend is further amplified by the recent uptake in edge computing for online services. However, despite the prosperity of existing lightweight embedding-based RSs (LERSs), a strong diversity is seen in the evaluation protocols adopted across publications, resulting in obstacles when relating the reported performance of those LERSs to their real-world usability. On the other hand, among the two fundamental recommendation tasks, namely traditional collaborative filtering and content-based recommendation, despite their common goal of achieving lightweight embeddings, the outgoing LERSs are designed and evaluated with a straightforward “either-or” choice between the two tasks. Consequently, the lack of discussions on a method's cross-task transferability will likely hinder the development of unified, more scalable solutions for production environments. Motivated by these unresolved issues, this study aims to systematically investigate existing LERSs’ performance, efficiency, and cross-task transferability via a thorough benchmarking process. To create a generic, task-independent baseline, we propose an efficient embedding compression approach based on magnitude pruning, which is proven to be an easy-to-deploy yet highly competitive baseline that outperforms various complex LERSs. Our study reveals the distinct performance of different LERSs across the two recommendation tasks, shedding light on their effectiveness and generalizability under different settings. Furthermore, to account for edge-based recommendation – an increasingly popular use case of LERSs, we have also deployed and tested all LERSs on a Raspberry Pi 4, where their efficiency bottleneck is exposed compared with GPU-based deployment. Finally, we conclude this paper with critical summaries on the performance comparison, suggestions on model selection based on task objectives, and underexplored challenges around the applicability of existing LERSs for future research. To encourage and support future LERS research, we publish all source codes and data, checkpoints, and documentation at https://github.com/chenxing1999/recsys-benchmark .

Problem

Research questions and friction points this paper is trying to address.

Lightweight Recommendation Systems

Complexity and Bloat

Cross-task Performance Analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Magnitude Pruning

Lightweight Recommendation System

Raspberry Pi 4 Efficiency

🔎 Similar Papers

Long-Sequence Recommendation Models Need Decoupled Embeddings