🤖 AI Summary
This work addresses the limitation of existing large language model (LLM)-based sequential recommendation methods, which often overlook visual product information and consequently struggle to distinguish items that are textually similar but visually distinct. To bridge this gap, we propose PixRec, the first vision-language framework that integrates visual context into LLM-based sequential recommendation. PixRec employs a dual-tower architecture to fuse product images with textual attributes and aligns multimodal features through a projection mechanism, jointly modeling user-item and item-item interactions via a hybrid training objective. Experimental results on the image-augmented Amazon Reviews dataset demonstrate that PixRec significantly outperforms text-only baselines, achieving a threefold improvement in Top-1 accuracy and a 40% gain in Top-10 accuracy, thereby substantially enhancing recommendation performance.
📝 Abstract
Large Language Models (LLMs) have recently shown strong potential for usage in sequential recommendation tasks through text-only models, which combine advanced prompt design, contrastive alignment, and fine-tuning on downstream domain-specific data. While effective, these approaches overlook the rich visual information present in many real-world recommendation scenarios, particularly in e-commerce. This paper proposes PixRec - a vision-language framework that incorporates both textual attributes and product images into the recommendation pipeline. Our architecture leverages a vision-language model backbone capable of jointly processing image-text sequences, maintaining a dual-tower structure and mixed training objective while aligning multi-modal feature projections for both item-item and user-item interactions. Using the Amazon Reviews dataset augmented with product images, our experiments demonstrate $3\times$ and 40% improvements in top-rank and top-10 rank accuracy over text-only recommenders respectively, indicating that visual features can help distinguish items with similar textual descriptions. Our work outlines future directions for scaling multi-modal recommenders training, enhancing visual-text feature fusion, and evaluating inference-time performance. This work takes a step toward building software systems utilizing visual information in sequential recommendation for real-world applications like e-commerce.