Can Subgraph Explanations Be Weaponized to Steal Graph Neural Networks?

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work reveals that interpretability interfaces of graph neural networks can be exploited under a strict black-box setting—where only discrete labels and binary explanation masks are accessible—to enable model extraction attacks. The study proposes the first model stealing method tailored for graph classification, innovatively leveraging subgraph explanations as attack signals. By employing Monte Carlo estimation of edge sensitivity to approximate decision boundaries and incorporating Hoeffding’s inequality for theoretical guarantees, the approach achieves significantly higher accuracy than existing baselines despite the absence of probability outputs, gradients, or confidence scores. Extensive experiments across multiple benchmark graph datasets demonstrate the effectiveness of the attack, highlighting the security risks introduced by explainability mechanisms and providing critical insights for designing robust defenses and informing AI governance policies.

📝 Abstract

Graph Machine Learning as a Service (GMLaaS) platforms increasingly implement explainability interfaces to meet regulatory transparency requirements. However, this transparency creates exploitable vulnerabilities for model extraction attacks. We present the first model extraction attack specifically designed for graph classification under strict black-box constraints where the attacker observes only discrete class labels and binary explanation masks (no probability scores, gradients, or confidence values). Our method (1) uses model explanation outputs to guide Monte Carlo edge sensitivity estimation toward decision boundaries, with Hoeffding concentration guarantees on estimation accuracy and (2) exploits explanation subgraphs to efficiently narrow the boundary search space. Extensive experiments on benchmark graph datasets across multiple domains demonstrate our method's superiority over comparable baselines. These findings demonstrate that such explainability interfaces create exploitable attack surfaces, informing both defensive mechanisms and policy frameworks for explainable AI mandates. The implementation code is provided in https://github.com/LabRAI/XSTEAL/.

Problem

Research questions and friction points this paper is trying to address.

model extraction

graph neural networks

explainability

black-box attack

subgraph explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

model extraction attack

graph neural networks

explainability