π€ AI Summary
This work addresses the vulnerability of large language model APIs to model extraction attacks, which are challenging to detect because individual attack queries closely resemble legitimate user requests in semantics. The authors formulate the detection problem as a distribution shift test against a reference window of benign traffic and propose a minimalist, unsupervised detection paradigm that sets adaptive thresholds using only historical benign data. Their approach leverages semantic embeddings combined with Maximum Mean Discrepancy (MMD) as a distance metric and demonstrates effectiveness in both single-user and multi-user mixed scenarios. Experimental results across 14 attackβbenign query pairs show that the method achieves a low benign false positive rate of 0.3%, perfect detection of pure attackers (100%), an average attack detection rate of 90.5%, and a balanced accuracy of 95.1%.
π Abstract
Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd-llm-mea-detection.