Score: 1

Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation

Published: April 24, 2025 | arXiv ID: 2504.17480v4

By: Xin Yi , Yue Li , Shunfan Zheng and more

Potential Business Impact:

Makes AI models reveal if they copied others.

Business Areas:

Water Purification Sustainability

Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.

Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation

Sound

Stops fake voices from being used wrongly.

24 Sep 2025 1

90%

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

Cryptography and Security

Makes AI text fake watermarks to trick people.

13 Oct 2025 1

89%

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

Cryptography and Security

Protects AI writing from being faked or changed.

18 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com github.com

Page Count

20 pages

Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation

Makes AI models reveal if they copied others.

Technical Abstract

Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation

DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack