🤖 AI Summary
This study systematically investigates the capability of large language models (LLMs) to mitigate popularity bias in third-party library (TPL) recommendation. Addressing the prevalent issue where existing recommenders over-prioritize popular libraries while neglecting long-tail ones, we introduce the first standardized evaluation framework for TPL recommendation, incorporating fine-tuning, post-hoc popularity penalization, and diversity-aware metrics in ablation studies. Results show that state-of-the-art LLMs significantly improve recommendation diversity (e.g., coverage +32%) but fail to fundamentally decouple semantic relevance from statistical popularity—yielding no gain in accuracy and exposing an intrinsic limitation. Our key contributions are: (1) the first LLM-oriented benchmark explicitly designed to evaluate popularity bias in TPL recommendation; (2) empirical identification of the fundamental bottleneck in relevance-popularity disentanglement; and (3) rigorous validation of the limits of data-centric and post-processing strategies, providing foundational evidence for future causal modeling and disentangled learning approaches.
📝 Abstract
Recommender systems for software engineering (RSSE) play a crucial role in automating development tasks by providing relevant suggestions according to the developer's context. However, they suffer from the so-called popularity bias, i.e., the phenomenon of recommending popular items that might be irrelevant to the current task. In particular, the long-tail effect can hamper the system's performance in terms of accuracy, thus leading to false positives in the provided recommendations. Foundation models are the most advanced generative AI-based models that achieve relevant results in several SE tasks. This paper aims to investigate the capability of large language models (LLMs) to address the popularity bias in recommender systems of third-party libraries (TPLs). We conduct an ablation study experimenting with state-of-the-art techniques to mitigate the popularity bias, including fine-tuning and popularity penalty mechanisms. Our findings reveal that the considered LLMs cannot address the popularity bias in TPL recommenders, even though fine-tuning and post-processing penalty mechanism contributes to increasing the overall diversity of the provided recommendations. In addition, we discuss the limitations of LLMs in this context and suggest potential improvements to address the popularity bias in TPL recommenders, thus paving the way for additional experiments in this direction.