Score: 2

Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Published: December 26, 2025 | arXiv ID: 2512.22306v1

By: Chinmay Pushkar , Sanchit Kabra , Dhruv Kumar and more

Potential Business Impact:

Finds many computer bugs in code.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from "count bias" and "selection bias" in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k-10k tokens) sourced from CodeParrot. We evaluate five state-of-the-art LLMs, including GPT-4o-mini, Llama-3.3-70B, and the Qwen-2.5 series. Our results reveal a sharp degradation in performance as vulnerability density increases. While Llama-3.3-70B achieves near-perfect F1 scores (approximately 0.97) on single-vulnerability C tasks, performance drops by up to 40% in high-density settings. Notably, Python and JavaScript show distinct failure modes compared to C/C++, with models exhibiting severe "under-counting" (Recall dropping to less than 0.30) in complex Python files.

Large Language Models for Multilingual Vulnerability Detection: How Far Are We?

Software Engineering

Finds hidden computer bugs in many languages.

9 Jun 2025 2

92%

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

Software Engineering

Finds hidden bugs in computer code.

3 Mar 2025 2

92%

LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python

Software Engineering

Finds simple and some tricky computer bugs.

22 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇮🇳 United States, India

Page Count

8 pages

Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Finds many computer bugs in code.

Technical Abstract

Large Language Models for Multilingual Vulnerability Detection: How Far Are We?

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python