LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code
By: Liwei Guo , Sixiang Ye , Zeyu Sun and more
Potential Business Impact:
Computers struggle to fix buggy code.
Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code. Yet, it remains unclear to what extent these buggy instances influence LLMs' performance when tackling bug-prone code completion tasks. To fill this gap, this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code. Through extensive experiments on 7 LLMs and the Defects4J dataset, we analyze LLMs' accuracy, robustness, and limitations in this challenging context. Our experimental results show that completing bug-prone code is significantly more challenging for LLMs than completing normal code. Notably, in bug-prone tasks, the likelihood of LLMs generating correct code is nearly the same as generating buggy code, and it is substantially lower than in normal code completion tasks (e.g., 12.27% vs. 29.85% for GPT-4). To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios. Our research highlights the limitations of current LLMs in handling bug-prone code and underscores the need for improved models and post-processing strategies to enhance code completion accuracy in real-world development environments.
Similar Papers
Large Language Models for Fault Localization: An Empirical Study
Software Engineering
Finds bugs in computer code faster.
An Empirical Study on the Capability of LLMs in Decomposing Bug Reports
Software Engineering
Helps computers break down bug reports faster.
An Empirical Study of LLM-Based Code Clone Detection
Software Engineering
Helps computers find similar code, but not always.