Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
By: Piercosma Bisconti , Matteo Prandi , Federico Pierucci and more
Potential Business Impact:
Makes AI write bad things even when told not to.
We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
Similar Papers
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models
Computation and Language
Makes AI write bad things using poems.
Adversarial versification in portuguese as a jailbreak operator in LLMs
Computation and Language
Makes AI chatbots ignore rules when asked in poems.
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
Cryptography and Security
Makes AI systems safer from hacking tricks.