A fully automated and scalable Parallel Data Augmentation for Low Resource Languages using Image and Text Analytics
By: Prawaal Sharma , Navneet Goyal , Poonam Goyal and more
Potential Business Impact:
Helps computers understand many languages better.
Linguistic diversity across the world creates a disparity with the availability of good quality digital language resources thereby restricting the technological benefits to majority of human population. The lack or absence of data resources makes it difficult to perform NLP tasks for low-resource languages. This paper presents a novel scalable and fully automated methodology to extract bilingual parallel corpora from newspaper articles using image and text analytics. We validate our approach by building parallel data corpus for two different language combinations and demonstrate the value of this dataset through a downstream task of machine translation and improve over the current baseline by close to 3 BLEU points.
Similar Papers
Parallel Corpora for Machine Translation in Low-resource Indic Languages: A Comprehensive Review
Computation and Language
Helps computers translate Indian languages better.
UPRPRC: Unified Pipeline for Reproducing Parallel Resources -- Corpus from the United Nations
Computation and Language
Makes computer translators understand more languages better.
From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation
Computation and Language
Improves translation for African languages.