On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems
By: Alexander Berndt , Thomas Bach , Rainer Gemulla and more
Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a certain order that is not guaranteed ("unordered collection"), which was present in 72 of 115 flaky tests (63%). Furthermore, both LLMs transferred the flakiness from the existing tests to the newly generated tests via the provided prompt context. Our experiments suggest that flakiness transfer is more prevalent in closed-source systems such as SAP HANA than in open-source systems. Our study informs developers on what types of flakiness to expect from LLM-generated tests. It also highlights the importance of providing LLMs with tailored context when employing LLMs for test generation.
Similar Papers
Understanding LLM-Driven Test Oracle Generation
Software Engineering
AI finds bugs in computer programs automatically.
Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead
Software Engineering
Helps computers write better code tests automatically.
Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics
Software Engineering
Makes computer code better, not just working.