Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis
By: Felipe Ribeiro Fujita de Mello, Hideyuki Takada
Potential Business Impact:
Makes computer translators much smarter with better word choices.
We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.
Similar Papers
Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Computation and Language
Makes computer translations much better and faster.
Exploring Parameter-Efficient Fine-Tuning and Backtranslation for the WMT 25 General Translation Task
Computation and Language
Improves Japanese to English translation quality.
Improving LLMs for Machine Translation Using Synthetic Preference Data
Computation and Language
Makes computer translations much better and more accurate.