Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications
By: Janis Keuper
Potential Business Impact:
Makes AI reviews unfairly accept almost everything.
The ongoing intense discussion on rising LLM usage in the scientific peer-review process has recently been mingled by reports of authors using hidden prompt injections to manipulate review scores. Since the existence of such "attacks" - although seen by some commentators as "self-defense" - would have a great impact on the further debate, this paper investigates the practicability and technical success of the described manipulations. Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a wide range of LLMs shows two distinct results: I) very simple prompt injections are indeed highly effective, reaching up to 100% acceptance scores. II) LLM reviews are generally biased toward acceptance (>95% in many models). Both results have great impact on the ongoing discussions on LLM usage in peer-review.
Similar Papers
Prompt Injection Attacks on LLM Generated Reviews of Scientific Publications
Machine Learning (CS)
Makes AI reviewers unfairly accept almost everything.
Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review
Cryptography and Security
Tricks AI into writing fake science reviews.
Publish to Perish: Prompt Injection Attacks on LLM-Assisted Peer Review
Cryptography and Security
Tricks AI reviewers to miss hidden bad ideas.