๐ค AI Summary
The effectiveness and applicability boundaries of large language models (LLMs) in recommendation tasks remain poorly understood. Method: We propose a unified prompt engineering framework that reformulates recommendation as natural language inference, enabling zero-shot and cross-scenario generalization. We conduct controlled, multi-dimensional experiments on MovieLens and Amazon datasets to isolate the independent effects of LLM architecture, parameter scale, context length, and four prompt componentsโtask description, user interest modeling, candidate item construction, and prompting strategy. Contribution/Results: Our study establishes a reproducible evaluation paradigm and demonstrates that LLMs possess intrinsic zero-shot recommendation capability. However, prompt quality and fidelity of user interest modeling constitute critical bottlenecks. Structurally optimizing prompts yields substantial performance gains. This work provides both an empirically grounded benchmark and a practical, deployable technical pathway for LLM-based recommender systems.
๐ Abstract
Recently, Large Language Models~(LLMs) such as ChatGPT have showcased remarkable abilities in solving general tasks, demonstrating the potential for applications in recommender systems. To assess how effectively LLMs can be used in recommendation tasks, our study primarily focuses on employing LLMs as recommender systems through prompting engineering. We propose a general framework for utilizing LLMs in recommendation tasks, focusing on the capabilities of LLMs as recommenders. To conduct our analysis, we formalize the input of LLMs for recommendation into natural language prompts with two key aspects, and explain how our framework can be generalized to various recommendation scenarios. As for the use of LLMs as recommenders, we analyze the impact of public availability, tuning strategies, model architecture, parameter scale, and context length on recommendation results based on the classification of LLMs. As for prompt engineering, we further analyze the impact of four important components of prompts, ie task descriptions, user interest modeling, candidate items construction and prompting strategies. In each section, we first define and categorize concepts in line with the existing literature. Then, we propose inspiring research questions followed by detailed experiments on two public datasets, in order to systematically analyze the impact of different factors on performance. Based on our empirical analysis, we finally summarize promising directions to shed lights on future research.