Score: 1

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Published: May 24, 2025 | arXiv ID: 2505.18864v1

By: Binhao Ma , Hanqing Guo , Zhengping Jay Luo and more

Potential Business Impact:

Makes AI assistants say bad things using your voice.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced the naturalness and flexibility of human computer interaction by enabling seamless understanding across text, vision, and audio modalities. Among these, voice enabled models such as SpeechGPT have demonstrated considerable improvements in usability, offering expressive, and emotionally responsive interactions that foster deeper connections in real world communication scenarios. However, the use of voice introduces new security risks, as attackers can exploit the unique characteristics of spoken language, such as timing, pronunciation variability, and speech to text translation, to craft inputs that bypass defenses in ways not seen in text-based systems. Despite substantial research on text based jailbreaks, the voice modality remains largely underexplored in terms of both attack strategies and defense mechanisms. In this work, we present an adversarial attack targeting the speech input of aligned MLLMs in a white box scenario. Specifically, we introduce a novel token level attack that leverages access to the model's speech tokenization to generate adversarial token sequences. These sequences are then synthesized into audio prompts, which effectively bypass alignment safeguards and to induce prohibited outputs. Evaluated on SpeechGPT, our approach achieves up to 89 percent attack success rate across multiple restricted tasks, significantly outperforming existing voice based jailbreak methods. Our findings shed light on the vulnerabilities of voice-enabled multimodal systems and to help guide the development of more robust next-generation MLLMs.

"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

Machine Learning (CS)

Makes AI hear bad words hidden in sounds.

2 Feb 2025 1

90%

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Cryptography and Security

Finds ways to trick AI into saying bad things.

21 Nov 2025 0

90%

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Cryptography and Security

Tricks AI into showing bad stuff using pictures.

23 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

7 pages

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Makes AI assistants say bad things using your voice.

Technical Abstract

"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations