Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation
By: Xin Yi , Yue Li , Shunfan Zheng and more
Potential Business Impact:
Makes AI models reveal if they copied others.
Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
Similar Papers
Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
Sound
Stops fake voices from being used wrongly.
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation
Cryptography and Security
Makes AI text fake watermarks to trick people.
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
Cryptography and Security
Protects AI writing from being faked or changed.