Chase: LLM Agents for Dissecting Malicious PyPI Packages

📅 2025-11-19
🏛️ 2025 2nd IEEE/ACM International Conference on AI-powered Software (AIware)
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the growing challenge of sophisticated multi-stage malware attacks in software package repositories such as PyPI, where existing large language models (LLMs) often suffer from hallucination and context confusion, leading to false negatives and false positives. To overcome these limitations, we propose CHASE, a novel architecture that synergistically integrates the semantic reasoning capabilities of LLMs with deterministic security tools through a multi-agent collaborative plan-and-execute framework, enabling highly reliable automated detection. Evaluated on a test set of 3,000 packages, CHASE achieves a recall of 98.4% with a false positive rate of only 0.08%, while maintaining a median analysis time of just 4.5 minutes per package. This approach significantly advances the accuracy, efficiency, and deployability of AI-driven security analysis.

Technology Category

Application Category

📝 Abstract
Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms. We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multiagent architecture that addresses these limitations through a Plan-and-Execute coordination model, specialized Worker Agents focused on specific analysis aspects, and integration with deterministic security tools for critical operations. Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths. Evaluation on a dataset of $\mathbf{3, 0 0 0}$ packages ($\mathbf{5 0 0}$ malicious, $\mathbf{2, 5 0 0}$ benign) demonstrates that CHASE achieves 98.4 % recall with only 0.08 % false positive rate, while maintaining a practical median analysis time of 4.5 minutes per package, making it suitable for operational deployment in automated package screening. Furthermore, we conducted a survey with cybersecurity professionals to evaluate the generated analysis reports, identifying their key strengths and areas for improvement. This work provides a blueprint for building reliable AI-powered security tools that can scale with the growing complexity of modern software supply chains. Our project page is available at: https://t0d4.github.io/CHASE-AIware25/
Problem

Research questions and friction points this paper is trying to address.

malicious PyPI packages
LLM hallucination
software supply chain security
multi-stage attacks
false positives
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent architecture
LLM-based malware detection
Plan-and-Execute coordination
software supply chain security
deterministic tool integration
🔎 Similar Papers
No similar papers found.
T
Takaaki Toda
Department of Computer Science and Engineering, Waseda University
Tatsuya Mori
Tatsuya Mori
Professor, Waseda University
Internet securityInternet measurement