🤖 AI Summary
Existing LLM-based recommender systems suffer from inefficient negative sample utilization: naïvely aggregating negative samples improves ranking accuracy and mitigates popularity bias but incurs substantial computational and memory overhead, while ignoring inter-sample differences in informativeness—limiting optimization efficacy. This paper proposes an efficient preference optimization framework featuring two core innovations: (1) intra-batch negative sample sharing, enabling scalable expansion of negative sample size without proportional cost increase; and (2) dynamic reward margin adjustment, which differentiates sample informativeness to guide more effective learning. The method unifies preference optimization, contrastive learning, and dynamic-margin reinforcement learning. Evaluated on three public benchmarks, it significantly outperforms state-of-the-art approaches—achieving higher recommendation accuracy while more effectively suppressing popularity bias.
📝 Abstract
Recommendation systems leverage user interaction data to suggest relevant items while filtering out irrelevant (negative) ones. The rise of large language models (LLMs) has garnered increasing attention for their potential in recommendation tasks. However, existing methods for optimizing LLM-based recommenders face challenges in effectively utilizing negative samples. Simply integrating large numbers of negative samples can improve ranking accuracy and mitigate popularity bias but often leads to increased computational overhead and memory costs. Additionally, current approaches fail to account for the varying informativeness of negative samples, leading to suboptimal optimization performance. To address these issues, we propose NAPO ( extbf{N}egative- extbf{A}ware extbf{P}reference extbf{O}ptimization), an enhanced framework for preference optimization in LLM-based recommendation. NAPO introduces two key innovations: (1) in-batch negative sharing, which expands the pool of negative samples without additional memory overhead, and (2) dynamic reward margin adjustment, which adapts model updates based on the confidence of negative samples. Extensive experiments on three public datasets demonstrate that NAPO outperforms existing methods in both recommendation accuracy and popularity bias reduction.