ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations
By: Rania Al-Sabbagh
Potential Business Impact:
Helps computers translate Egyptian Arabic songs and stories.
ArzEn-MultiGenre is a parallel dataset of Egyptian Arabic song lyrics, novels, and TV show subtitles that are manually translated and aligned with their English counterparts. The dataset contains 25,557 segment pairs that can be used to benchmark new machine translation models, fine-tune large language models in few-shot settings, and adapt commercial machine translation applications such as Google Translate. Additionally, the dataset is a valuable resource for research in various disciplines, including translation studies, cross-linguistic analysis, and lexical semantics. The dataset can also serve pedagogical purposes by training translation students and aid professional translators as a translation memory. The contributions are twofold: first, the dataset features textual genres not found in existing parallel Egyptian Arabic and English datasets, and second, it is a gold-standard dataset that has been translated and aligned by human experts.
Similar Papers
AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts
Computation and Language
Helps computers translate hard Arabic and English texts.
Multi-label Cross-lingual automatic music genre classification from lyrics with Sentence BERT
Information Retrieval
Lets computers guess song styles from any language.
MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data
Computation and Language
Helps computers translate more languages and writing styles.