No Language Left Behind: Scaling a Machine Translation Pipeline to Low Resource Languages
Progress in machine translation has coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. In No Language Left Behind, we took on the challenge of breaking the 200 language barrier, scaling on the engineering and modeling side, but also on evaluation. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying groundwork towards universal translation. This presentation will first give an overview of our approach. It will then dive deep into our automated data mining system, which was key in identifying precious training data for low-resource languages.
Alexandre Mourachko is a Research Engineering Manager at Meta AI. He currently leads research and engineering efforts for the No Language Left Behind project, which aims at breaking language barriers across the world by scaling machine translation to more low resource languages. Within this highly multidisciplinary endeavor, his research team focuses on bringing sentence representation learning to the next level, scaling data mining pipelines to web corpora, and mitigating toxicity and bias in translation model outputs. Before joining Meta, Alexandre was leading Machine Learning research teams fighting toxic behaviors in multilingual online gaming communities.