PackageIntel: Leveraging Large Language Models for Automated Intelligence Extraction in Package Ecosystems

📅 2024-09-23

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

188K/year

🤖 AI Summary

The surge of malicious packages in public package repositories (e.g., npm, PyPI) poses severe software supply chain security threats; however, existing Software Composition Analysis (SCA) tools suffer from intelligence latency, incomplete coverage, and low accuracy. Method: This paper proposes a novel automated threat intelligence extraction paradigm integrating snowball sampling, multi-source exhaustive search, and customized large language model (LLM) prompt engineering to enable early detection and high-precision identification of malicious packages. Contribution/Results: We construct a curated database of 20,692 high-quality malicious packages. Our approach achieves 98.6% precision and an F1-score of 92.0; reduces average threat discovery time by 70%; has reported over 1,000 malicious packages to mirror repositories; and incurs only $0.094 per intelligence item—enabling cost-effective, large-scale operational deployment.

Technology Category

Application Category

📝 Abstract

The rise of malicious packages in public registries poses a significant threat to software supply chain (SSC) security. Although academia and industry employ methods like software composition analysis (SCA) to address this issue, existing approaches often lack timely and comprehensive intelligence updates. This paper introduces PackageIntel, a novel platform that revolutionizes the collection, processing, and retrieval of malicious package intelligence. By utilizing exhaustive search techniques, snowball sampling from diverse sources, and large language models (LLMs) with specialized prompts, PackageIntel ensures enhanced coverage, timeliness, and accuracy. We have developed a comprehensive database containing 20,692 malicious NPM and PyPI packages sourced from 21 distinct intelligence repositories. Empirical evaluations demonstrate that PackageIntel achieves a precision of 98.6% and an F1 score of 92.0 in intelligence extraction. Additionally, it detects threats on average 70% earlier than leading databases like Snyk and OSV, and operates cost-effectively at $0.094 per intelligence piece. The platform has successfully identified and reported over 1,000 malicious packages in downstream package manager mirror registries. This research provides a robust, efficient, and timely solution for identifying and mitigating threats within the software supply chain ecosystem.

Problem

Research questions and friction points this paper is trying to address.

Detecting malicious packages in software supply chains using LLMs

Improving timeliness and coverage of threat intelligence updates

Automating intelligence extraction from multiple package ecosystem sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs with specialized prompts for intelligence extraction

Using exhaustive search and snowball sampling from diverse sources

Achieving high precision and early threat detection cost-effectively

🔎 Similar Papers

No similar papers found.