Extending MovieLens-32M to Provide New Evaluation Objectives

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional offline evaluation of recommender systems focuses on predicting users’ historically high-rated items, creating a fundamental misalignment with the true objective—predicting users’ latent watch intent. Method: This paper redefines evaluation by adopting “watch intent” as the primary target. We construct the MovieLens-32M-Intent extension: a new benchmark dataset where personalized watch-intent signals are collected via user surveys and behavioral annotation, enabling the first use of users’ explicit, active viewing intentions toward candidate items as supervision labels. To mitigate popularity bias, we design a multi-algorithm result aggregation and debiased labeling pipeline. Contribution/Results: Experiments show that popularity-based baselines perform worst under this paradigm—demonstrating its superior alignment with the core purpose of recommendation. The dataset is publicly released, establishing a more realistic and behaviorally grounded evaluation benchmark for recommender systems.

Technology Category

Application Category

📝 Abstract
Offline evaluation of recommender systems has traditionally treated the problem as a machine learning problem. In the classic case of recommending movies, where the user has provided explicit ratings of which movies they like and don't like, each user's ratings are split into test and train sets, and the evaluation task becomes to predict the held out test data using the training data. This machine learning style of evaluation makes the objective to recommend the movies that a user has watched and rated highly, which is not the same task as helping the user find movies that they would enjoy if they watched them. This mismatch in objective between evaluation and task is a compromise to avoid the cost of asking a user to evaluate recommendations by watching each movie. As a resource available for download, we offer an extension to the MovieLens-32M dataset that provides for new evaluation objectives. Our primary objective is to predict the movies that a user would be interested in watching, i.e. predict their watchlist. To construct this extension, we recruited MovieLens users, collected their profiles, made recommendations with a diverse set of algorithms, pooled the recommendations, and had the users assess the pools. Notably, we found that the traditional machine learning style of evaluation ranks the Popular algorithm, which recommends movies based on total number of ratings in the system, in the middle of the twenty-two recommendation runs we used to build the pools. In contrast, when we rank the runs by users' interest in watching movies, we find that recommending popular movies as a recommendation algorithm becomes one of the worst performing runs. It appears that by asking users to assess their personal recommendations, we can alleviate the popularity bias issues created by using information retrieval effectiveness measures for the evaluation of recommender systems.
Problem

Research questions and friction points this paper is trying to address.

Extend MovieLens-32M dataset for new evaluation objectives
Predict user watchlist instead of highly-rated movies
Address popularity bias in recommender system evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends MovieLens-32M dataset for new objectives
Predicts user watchlist instead of rated movies
Reduces popularity bias via user-assessed recommendations
🔎 Similar Papers
No similar papers found.