AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts
By: Baorong Huang, Ali Asiri
Potential Business Impact:
Helps computers translate hard Arabic and English texts.
High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising complex legal and literary texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated superior robustness, achieving an overall F1-score of 85.5%, a 9% improvement over previous methods. Our datasets and codes are open-sourced at https://github.com/XXX.
Similar Papers
ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations
Computation and Language
Helps computers translate Egyptian Arabic songs and stories.
ALARB: An Arabic Legal Argument Reasoning Benchmark
Computation and Language
Helps computers understand Arabic law cases better.
DualAlign: Generating Clinically Grounded Synthetic Data
Machine Learning (CS)
Creates realistic fake patient data for medical AI.