Score: 0

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Published: November 19, 2025 | arXiv ID: 2511.15304v1

By: Piercosma Bisconti , Matteo Prandi , Federico Pierucci and more

Potential Business Impact:

Makes AI write bad things even when told not to.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Computation and Language

Makes AI write bad things using poems.

19 Nov 2025 0

93%

Adversarial versification in portuguese as a jailbreak operator in LLMs

Computation and Language

Makes AI chatbots ignore rules when asked in poems.

17 Dec 2025 0

89%

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models

Cryptography and Security

Makes AI systems safer from hacking tricks.

24 Oct 2025 0

View PDF Login to Bookmark

Page Count

16 pages

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Makes AI write bad things even when told not to.

Technical Abstract

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Adversarial versification in portuguese as a jailbreak operator in LLMs

Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models