🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models (LLMs) in recommendation systems beyond accuracy, particularly with respect to diversity, novelty, and popularity bias. It presents the first comprehensive assessment of ChatGPT-3.5 and ChatGPT-4 in Top-N recommendation and cold-start scenarios, leveraging three real-world datasets and integrating both traditional recommendation metrics and non-accuracy dimensions. The results demonstrate that ChatGPT-4 matches or even surpasses conventional recommender models in diversity and novelty, while significantly improving both accuracy and novelty in cold-start settings. Furthermore, the analysis reveals nuanced behaviors of ChatGPT-4 in either mitigating or exacerbating popularity bias, offering critical empirical insights into the potential and limitations of LLMs for recommendation tasks.
📝 Abstract
ChatGPT has emerged as a versatile tool, demonstrating capabilities across diverse domains. Given these successes, the Recommender Systems (RSs) community has begun investigating its applications within recommendation scenarios primarily focusing on accuracy. While the integration of ChatGPT into RSs has garnered significant attention, a comprehensive analysis of its performance across various dimensions remains largely unexplored. Specifically, the capabilities of providing diverse and novel recommendations or exploring potential biases such as popularity bias have not been thoroughly examined. As the use of these models continues to expand, understanding these aspects is crucial for enhancing user satisfaction and achieving long-term personalization. This study investigates the recommendations provided by ChatGPT-3.5 and ChatGPT-4 by assessing ChatGPT's capabilities in terms of diversity, novelty, and popularity bias. We evaluate these models on three distinct datasets and assess their performance in Top-N recommendation and cold-start scenarios. The findings reveal that ChatGPT-4 matches or surpasses traditional recommenders, demonstrating the ability to balance novelty and diversity in recommendations. Furthermore, in the cold-start scenario, ChatGPT models exhibit superior performance in both accuracy and novelty, suggesting they can be particularly beneficial for new users. This research highlights the strengths and limitations of ChatGPT's recommendations, offering new perspectives on the capacity of these models to provide recommendations beyond accuracy-focused metrics.