T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

📅 2026-06-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing model watermarking techniques lack robustness against model extraction attacks, limiting their effectiveness in intellectual property protection. This work proposes a replay-based watermark embedding framework that explicitly models the model extraction process as a training signal for the first time. By simulating the behavior of stolen models on an adversarially crafted trigger set, the framework uses the resulting loss to guide fine-tuning of the target model, thereby enhancing the transferability of embedded watermark knowledge. Integrating adversarial trigger set design, simulated extraction, and knowledge transfer optimization, the method significantly improves both the robustness and detectability of watermarks under diverse attack scenarios, including model extraction and subsequent removal attempts.

📝 Abstract

Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model's functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textit{simulated stolen model} on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.

Problem

Research questions and friction points this paper is trying to address.

model watermarking

model extraction attacks

watermark robustness

intellectual property protection

surrogate models

Innovation

Methods, ideas, or system contributions that make the work stand out.

model watermarking

model extraction attack

rehearsal-based learning