Score: 0

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Published: December 16, 2025 | arXiv ID: 2512.14860v1

By: Viet K. Nguyen, Mohammad I. Husain

Potential Business Impact:

Finds new ways AI can be tricked.

Business Areas:

Artificial Intelligence Artificial Intelligence, Data and Analytics, Science and Engineering, Software

Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.

Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

Artificial Intelligence

Helps AI agents work together to solve problems.

13 Aug 2025 0

90%

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Computation and Language

Finds hidden dangers in smart computer helpers.

5 Sep 2025 0

89%

Securing Agentic AI: Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System

Cryptography and Security

Protects smart AI from being tricked or broken.

12 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

13 pages

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Finds new ways AI can be tricked.

Technical Abstract

Agentic AI Frameworks: Architectures, Protocols, and Design Challenges

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Securing Agentic AI: Threat Modeling and Risk Analysis for Network Monitoring Agentic AI System