Score: 0

On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

Published: January 13, 2026 | arXiv ID: 2601.08998v1

By: Alexander Berndt , Thomas Bach , Rainer Gemulla and more

Flaky tests are a common problem in software testing. They produce inconsistent results when executed multiple times on the same code, invalidating the assumption that a test failure indicates a software defect. Recent work on LLM-based test generation has identified flakiness as a potential problem with generated tests. However, its prevalence and underlying causes are unclear. We examined the flakiness of LLM-generated tests in the context of four relational database management systems: SAP HANA, DuckDB, MySQL, and SQLite. We amplified test suites with two LLMs, GPT-4o and Mistral-Large-Instruct-2407, to assess the flakiness of the generated test cases. Our results suggest that generated tests have a slightly higher proportion of flaky tests compared to existing tests. Based on a manual inspection, we found that the most common root cause of flakiness was the reliance of a test on a certain order that is not guaranteed ("unordered collection"), which was present in 72 of 115 flaky tests (63%). Furthermore, both LLMs transferred the flakiness from the existing tests to the newly generated tests via the provided prompt context. Our experiments suggest that flakiness transfer is more prevalent in closed-source systems such as SAP HANA than in open-source systems. Our study informs developers on what types of flakiness to expect from LLM-generated tests. It also highlights the importance of providing LLMs with tailored context when employing LLMs for test generation.

Understanding LLM-Driven Test Oracle Generation

Software Engineering

AI finds bugs in computer programs automatically.

9 Jan 2026 1

87%

Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead

Software Engineering

Helps computers write better code tests automatically.

26 Nov 2025 2

87%

Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics

Software Engineering

Makes computer code better, not just working.

13 Nov 2025 1

View PDF Login to Bookmark

On the Flakiness of LLM-Generated Tests for Industrial and Open-Source Database Management Systems

Technical Abstract

Understanding LLM-Driven Test Oracle Generation

Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead

Quality Assurance of LLM-generated Code: Addressing Non-Functional Quality Characteristics