Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing LLM-based recommendation assistants are hindered by rigid prompt templates and the absence of realistic, interactive user-query datasets, limiting comprehensive evaluation under scenarios where hard constraints and soft preferences coexist. To address this, we propose RecBench+, the first high-quality, interactive, multi-difficulty benchmark tailored for LLM-era personalized recommendation assistants. It systematically covers core challenges: explicit constraint satisfaction, implicit preference modeling, logical reasoning, and robustness against interference. We introduce a novel LLM-oriented evaluation paradigm that jointly assesses explicit/implicit preference adherence and robustness, incorporating multi-dimensional query generation, difficulty-stratified annotation, and adversarial (misleading) sample design. Empirical results show that state-of-the-art LLMs perform well on explicit queries but exhibit significant degradation in reasoning-intensive and interference-prone settings. The RecBench+ dataset is publicly released.

Technology Category

Application Category

📝 Abstract

Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.

Problem

Research questions and friction points this paper is trying to address.

Traditional recommender systems lack generalization to new interactive tasks.

Existing LLM-based systems rely on fixed templates, limiting comprehensive evaluation.

RecBench+ dataset introduced to evaluate LLMs on complex user recommendation needs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs enhance interactive personalized recommendation systems

RecBench+ dataset evaluates LLMs on complex user queries

LLMs handle explicit conditions better than reasoning tasks

🔎 Similar Papers

No similar papers found.

Authors to Follow