Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework
By: Irtaza Sajid Qureshi, Zhen Ming, Jiang
Potential Business Impact:
Helps computers write better code tests.
Large Language Models (LLMs) are increasingly applied to automated software testing, yet their ability to generalize beyond memorized patterns and reason about natural language bug reports remains unclear. We present a systematic evaluation of LLM reasoning in test case generation, structured around the cognitive layers of Bloom's taxonomy: \textit{Remember}, \textit{Understand}, \textit{Apply}, \textit{Analyze}, \textit{Evaluate}, and \textit{Create}, which progressively assess higher levels of cognitive and reasoning capabilities. Building on the LIBRO framework, we evaluate StarCoder and GPT-4o on Defects4J, GHRB, and mutated variants that introduce linguistic and semantic challenges. Our findings show that both models largely reproduce prior results with minor deviations (\textit{Remember}), exhibit partial robustness to linguistic rephrasings and translations while uncovering unique reproducible bugs (\textit{Understand}), but suffer severe performance drops exceeding 60\% under identifier mutations (\textit{Apply}). Conversely, providing near-identical few-shot examples in an open-book setting improves success rates by up to three times, and component-level analysis reveals that structured technical elements, such as test code and method names, are far more impactful than narrative descriptions for successful test generation (\textit{Analyze}). These insights illuminate the cognitive processes underlying LLM-generated tests, suggest concrete directions for improving performance, and establish a robust and realistic evaluation paradigm for this task.
Similar Papers
Automatic High-Level Test Case Generation using Large Language Models
Software Engineering
Helps computers write tests that match what businesses want.
Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving
Software Engineering
AI helps teachers grade student code better.
Acceptance Test Generation with Large Language Models: An Industrial Case Study
Software Engineering
Helps make sure websites work right automatically.