T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing model watermarking techniques lack robustness against model extraction attacks, limiting their effectiveness in intellectual property protection. This work proposes a replay-based watermark embedding framework that explicitly models the model extraction process as a training signal for the first time. By simulating the behavior of stolen models on an adversarially crafted trigger set, the framework uses the resulting loss to guide fine-tuning of the target model, thereby enhancing the transferability of embedded watermark knowledge. Integrating adversarial trigger set design, simulated extraction, and knowledge transfer optimization, the method significantly improves both the robustness and detectability of watermarks under diverse attack scenarios, including model extraction and subsequent removal attempts.
📝 Abstract
Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model's functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textit{simulated stolen model} on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.
Problem

Research questions and friction points this paper is trying to address.

model watermarking
model extraction attacks
watermark robustness
intellectual property protection
surrogate models
Innovation

Methods, ideas, or system contributions that make the work stand out.

model watermarking
model extraction attack
rehearsal-based learning
watermark robustness
transferable watermark