Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

📅 2026-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing audio watermarking methods, which often exhibit insufficient robustness under speech reconstruction models and struggle to balance robustness with fidelity. The authors propose a feature-aligned speech watermarking approach that innovatively leverages a pretrained neural speech codec to generate speech-like watermarks and embeds them into the original speech via spectrogram fusion. To enhance both imperceptibility and robustness, the method incorporates voice activity detection (VAD) loss and perceptual loss to guide watermark embedding predominantly into voiced segments. Experimental results demonstrate that the proposed technique significantly improves watermark robustness against both seen and unseen speech reconstruction models while preserving high audio fidelity, thereby effectively overcoming the traditional trade-off between these competing objectives.
📝 Abstract
Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.
Problem

Research questions and friction points this paper is trying to address.

audio watermarking
robustness
speech reconstruction
fidelity trade-off
imperceptibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

feature-aligned watermarking
speech reconstruction robustness
imperceptible audio watermarking
pretrained speech codec
VAD-guided embedding
H
Haiyun Li
Shenzhen International Graduate School, Tsinghua University, China; Pengcheng Laboratory, China
S
Shuhai Peng
Shenzhen International Graduate School, Tsinghua University, China
Z
Zhisheng Zhang
Shenzhen International Graduate School, Tsinghua University, China
J
Jingran Xie
Shenzhen International Graduate School, Tsinghua University, China
X
Xiaofeng Xie
Independent Researcher, China
Hanyang Peng
Hanyang Peng
Peng Cheng Laboratory
Deep LearningOptimization
Zhiyong Wu
Zhiyong Wu
Tsinghua Univerisity
Software Security