An Embarrassingly Simple Detector for Model Extraction Attacks in Large Language Model API Traffic

πŸ“… 2026-06-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

199K/year
πŸ€– AI Summary
This work addresses the vulnerability of large language model APIs to model extraction attacks, which are challenging to detect because individual attack queries closely resemble legitimate user requests in semantics. The authors formulate the detection problem as a distribution shift test against a reference window of benign traffic and propose a minimalist, unsupervised detection paradigm that sets adaptive thresholds using only historical benign data. Their approach leverages semantic embeddings combined with Maximum Mean Discrepancy (MMD) as a distance metric and demonstrates effectiveness in both single-user and multi-user mixed scenarios. Experimental results across 14 attack–benign query pairs show that the method achieves a low benign false positive rate of 0.3%, perfect detection of pure attackers (100%), an average attack detection rate of 90.5%, and a balanced accuracy of 95.1%.
πŸ“ Abstract
Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd-llm-mea-detection.
Problem

Research questions and friction points this paper is trying to address.

model extraction attacks
large language models
API traffic
anomaly detection
distribution testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

model extraction detection
distribution testing
maximum mean discrepancy
LLM API security
benign-calibrated monitoring