Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
By: Le Chen , Nuo Xu , Winson Chen and more
Potential Business Impact:
Makes old computer code work on new systems.
Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
Similar Papers
LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study
Software Engineering
Turns old science code into new code.
Tutoring LLM into a Better CUDA Optimizer
Distributed, Parallel, and Cluster Computing
Helps computers write faster code for tasks.
Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks
Machine Learning (CS)
Teaches computers to code by showing thinking steps.