PrahokBART: A Pre-trained Sequence-to-Sequence Model for Khmer Natural Language Generation
By: Hour Kaing , Raj Dabre , Haiyue Song and more
Potential Business Impact:
Helps computers understand and write Khmer text.
This work introduces {\it PrahokBART}, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
Similar Papers
State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla
Computation and Language
Helps computers translate written words to sign language.
Towards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model
Computation and Language
Helps translate between two Vietnamese languages.
Can maiBERT Speak for Maithili?
Computation and Language
Helps computers understand a new language.