🤖 AI Summary
Existing LLM-based Top-k recommendation models lack systematic, component-level evaluation frameworks, leading to inflated performance estimates—particularly due to prompt leakage in pointwise methods—and poor cross-dataset reproducibility.
Method: We propose RecRankerEval, a five-dimensional evaluation framework for RecRanker, systematically analyzing contributions of user sampling, initial ranking models, LLM backbones, datasets, and instruction tuning. We introduce cluster-based user sampling, multi-source initial ranking generation, and hybrid instruction tuning to enable flexible LLM–recommendation module collaboration.
Contribution/Results: RecRankerEval is the first component-attribution framework for LLM-based Top-k recommendation, exposing prompt leakage as a key source of evaluation bias and significantly improving fairness and generalization. It achieves reproducible, state-of-the-art performance on ML-100K, ML-1M, and Amazon-Music—surpassing the original RecRanker—while establishing a benchmarked, extensible paradigm for rigorous, reproducible analysis.
📝 Abstract
A recent Large language model (LLM)-based recommendation model, called RecRanker, has demonstrated a superior performance in the top-k recommendation task compared to other models. In particular, RecRanker samples users via clustering, generates an initial ranking list using an initial recommendation model, and fine-tunes an LLM through hybrid instruction tuning to infer user preferences. However, the contribution of each core component remains underexplored. In this work, we inspect the reproducibility of RecRanker, and study the impact and role of its various components. We begin by reproducing the RecRanker pipeline through the implementation of all its key components. Our reproduction shows that the pairwise and listwise methods achieve a performance comparable to that reported in the original paper. For the pointwise method, while we are also able to reproduce the original paper's results, further analysis shows that the performance is abnormally high due to data leakage from the inclusion of ground-truth information in the prompts. To enable a fair and comprehensive evaluation of LLM-based top-k recommendations, we propose RecRankerEval, an extensible framework that covers five key dimensions: user sampling strategy, initial recommendation model, LLM backbone, dataset selection, and instruction tuning method. Using the RecRankerEval framework, we show that the original results of RecRanker can be reproduced on the ML-100K and ML-1M datasets, as well as the additional Amazon-Music dataset, but not on BookCrossing due to the lack of timestamp information in the original RecRanker paper. Furthermore, we demonstrate that RecRanker's performance can be improved by employing alternative user sampling methods, stronger initial recommenders, and more capable LLMs.