A Reinforcement Learning Framework for Robust and Secure LLM Watermarking
By: Li An , Yujian Liu , Yepeng Liu and more
Potential Business Impact:
Makes AI writing harder to fake or remove.
Watermarking has emerged as a promising solution for tracing and authenticating text generated by large language models (LLMs). A common approach to LLM watermarking is to construct a green/red token list and assign higher or lower generation probabilities to the corresponding tokens, respectively. However, most existing watermarking algorithms rely on heuristic green/red token list designs, as directly optimizing the list design with techniques such as reinforcement learning (RL) comes with several challenges. First, desirable watermarking involves multiple criteria, i.e., detectability, text quality, robustness against removal attacks, and security against spoofing attacks. Directly optimizing for these criteria introduces many partially conflicting reward terms, leading to an unstable convergence process. Second, the vast action space of green/red token list choices is susceptible to reward hacking. In this paper, we propose an end-to-end RL framework for robust and secure LLM watermarking. Our approach adopts an anchoring mechanism for reward terms to ensure stable training and introduces additional regularization terms to prevent reward hacking. Experiments on standard benchmarks with two backbone LLMs show that our method achieves a state-of-the-art trade-off across all criteria, with notable improvements in resistance to spoofing attacks without degrading other criteria. Our code is available at https://github.com/UCSB-NLP-Chang/RL-watermark.
Similar Papers
Optimizing Token Choice for Code Watermarking: A RL Approach
Cryptography and Security
Finds fake computer code made by AI.
Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization
Cryptography and Security
Makes AI writing sound natural, not fake.
Yet Another Watermark for Large Language Models
Cryptography and Security
Marks computer writing so you know it's real.