Score: 0

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Published: September 18, 2025 | arXiv ID: 2509.15478v1

By: Madison Van Doren, Casey Ford, Emily Dix

Potential Business Impact:

Tests AI to see if it says bad things.

Business Areas:

Text Analytics Data and Analytics, Software

Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

CV and Pattern Recognition

Helps AI spot dangers on building sites.

9 Jun 2025 1

90%

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Computation and Language

Helps AI avoid showing dangerous things.

7 Jan 2026 1

90%

Survey of Adversarial Robustness in Multimodal Large Language Models

CV and Pattern Recognition

Makes AI understand pictures and words safely.

18 Mar 2025 0

View PDF Login to Bookmark

Page Count

22 pages

Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Tests AI to see if it says bad things.

Technical Abstract

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

When Helpers Become Hazards: A Benchmark for Analyzing Multimodal LLM-Powered Safety in Daily Life

Survey of Adversarial Robustness in Multimodal Large Language Models