MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking
By: Yizhou Zhao, Zhiwei Steven Wu, Adam Block
Potential Business Impact:
Makes AI writing harder to fake or change.
Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
Similar Papers
Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models
Machine Learning (CS)
Makes AI writing show it's from AI.
Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking
Computation and Language
Protects writing from being copied by AI.
Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization
Cryptography and Security
Makes AI writing sound natural, not fake.