Score: 1

Code Refactoring with LLM: A Comprehensive Evaluation With Few-Shot Settings

Published: November 26, 2025 | arXiv ID: 2511.21788v1

By: Md. Raihan Tapader , Md. Mostafizer Rahman , Ariful Islam Shiplu and more

Potential Business Impact:

Makes computer code cleaner and easier to fix.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In today's world, the focus of programmers has shifted from writing complex, error-prone code to prioritizing simple, clear, efficient, and sustainable code that makes programs easier to understand. Code refactoring plays a critical role in this transition by improving structural organization and optimizing performance. However, existing refactoring methods are limited in their ability to generalize across multiple programming languages and coding styles, as they often rely on manually crafted transformation rules. The objectives of this study are to (i) develop an Large Language Models (LLMs)-based framework capable of performing accurate and efficient code refactoring across multiple languages (C, C++, C#, Python, Java), (ii) investigate the impact of prompt engineering (Temperature, Different shot algorithm) and instruction fine-tuning on refactoring effectiveness, and (iii) evaluate the quality improvements (Compilability, Correctness, Distance, Similarity, Number of Lines, Token, Character, Cyclomatic Complexity) in refactored code through empirical metrics and human assessment. To accomplish these goals, we propose a fine-tuned prompt-engineering-based model combined with few-shot learning for multilingual code refactoring. Experimental results indicate that Java achieves the highest overall correctness up to 99.99% the 10-shot setting, records the highest average compilability of 94.78% compared to the original source code and maintains high similarity (Approx. 53-54%) and thus demonstrates a strong balance between structural modifications and semantic preservation. Python exhibits the lowest structural distance across all shots (Approx. 277-294) while achieving moderate similarity ( Approx. 44-48%) that indicates consistent and minimally disruptive refactoring.