LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python
By: Akshay Mhatre , Noujoud Nader , Patrick Diehl and more
Potential Business Impact:
Finds simple and some tricky computer bugs.
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.
Similar Papers
On the Evaluation of Large Language Models in Multilingual Vulnerability Repair
Software Engineering
Fixes computer code bugs in many languages.
From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence
Software Engineering
Helps computers write computer programs from words.
Large Language Models for Multilingual Vulnerability Detection: How Far Are We?
Software Engineering
Finds hidden computer bugs in many languages.