Agentic LMs: Hunting Down Test Smells
By: Rian Melo , Pedro Simões , Rohit Gheyi and more
Potential Business Impact:
Fixes bad code automatically to make programs better.
Test smells reduce test suite reliability and complicate maintenance. While many methods detect test smells, few support automated removal, and most rely on static analysis or machine learning. This study evaluates models with relatively small parameter counts - Llama-3.2-3B, Gemma-2-9B, DeepSeek-R1-14B, and Phi-4-14B - for their ability to detect and refactor test smells using agent-based workflows. We assess workflows with one, two, and four agents over 150 instances of 5 common smells from real-world Java projects. Our approach generalizes to Python, Golang, and JavaScript. All models detected nearly all instances, with Phi-4-14B achieving the best refactoring accuracy (pass@5 of 75.3%). Phi-4-14B with four-agents performed within 5% of proprietary LLMs (single-agent). Multi-agent setups outperformed single-agent ones in three of five smell types, though for Assertion Roulette, one agent sufficed. We submitted pull requests with Phi-4-14B-generated code to open-source projects and six were merged.
Similar Papers
Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study
Software Engineering
Fixes messy computer tests automatically.
RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring
Software Engineering
Makes computer code better and faster automatically.
Quality Assessment of Python Tests Generated by Large Language Models
Software Engineering
Makes computers write better test code.