An Embarrassingly Simple Detector for Model Extraction Attacks in Large Language Model API Traffic

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the vulnerability of large language model APIs to model extraction attacks, which are challenging to detect because individual attack queries closely resemble legitimate user requests in semantics. The authors formulate the detection problem as a distribution shift test against a reference window of benign traffic and propose a minimalist, unsupervised detection paradigm that sets adaptive thresholds using only historical benign data. Their approach leverages semantic embeddings combined with Maximum Mean Discrepancy (MMD) as a distance metric and demonstrates effectiveness in both single-user and multi-user mixed scenarios. Experimental results across 14 attack–benign query pairs show that the method achieves a low benign false positive rate of 0.3%, perfect detection of pure attackers (100%), an average attack detection rate of 90.5%, and a balanced accuracy of 95.1%.

📝 Abstract

Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd-llm-mea-detection.

Problem

Research questions and friction points this paper is trying to address.

model extraction attacks

large language models

API traffic

anomaly detection

distribution testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

model extraction detection

distribution testing

maximum mean discrepancy