Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
By: E. G. Santana Jr , Jander Pereira Santos Junior , Erlon P. Almeida and more
Potential Business Impact:
Fixes messy computer tests automatically.
Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites, using PyNose and TsDetect for initial smell detection, followed by LLM-driven refactoring. Gemini achieved the highest detection accuracy (74.35\% Python, 80.32\% Java), while LLaMA was lowest. All models could refactor smells, but effectiveness varied, sometimes introducing new smells. Gemini also improved test coverage, unlike GPT-4 and LLaMA, which often reduced it. These results highlight LLMs' potential for automated test smell refactoring, with Gemini as the strongest performer, though challenges remain across languages and smell types.
Similar Papers
Agentic LMs: Hunting Down Test Smells
Software Engineering
Fixes bad code automatically to make programs better.
Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset
Software Engineering
Cleans computer code to make programs better.
Investigating The Smells of LLM Generated Code
Software Engineering
Finds bad code written by AI.