9/13/2023 0 Comments Unveiled meaning in spanish![]() This timeline highlights a few noteworthy achievements. All of our data mining resources leverage publicly available data and are open sourced.įacebook AI’s new many-to-many multilingual model is a culmination of several years of pioneering work in MT across breakthrough models, data mining resources, and optimization techniques. As part of this effort, we created a new LASER 2.0 and improved fastText language identification, which improves the quality of mining and includes open sourced training and evaluation scripts. This was possible by combining complementary data mining resources that have been years in the making, including ccAligned, ccMatrix, and LASER. We took on this ambitious challenge of building the most diverse many-to-many MMT data set to date: 7.5 billion sentence pairs across 100 languages. For instance, if we need 10M sentence pairs for each direction, then we need to mine 1B sentence pairs for 10 languages and 100B sentence pairs for 100 languages. What’s more, the volume of data required for training grows quadratically with the number of languages that we support. It’s a lot easier to find translations for Chinese to English and English to French, than, say, French to Chinese. One of the biggest hurdles of building a many-to-many MMT model is curating large volumes of quality sentence pairs (also known as parallel sentences) for arbitrary translation directions not involving English. Mining Hundreds of Millions of Sentences for Thousands of Language Directions We used several scaling techniques to build a universal model with 15 billion parameters, which captures information from related languages and reflects a more diverse script of languages and morphology. Using novel mining strategies to create translation data, we built the first truly “many-to-many” data set with 7.5 billion sentences for 100 languages. Our single multilingual model performs as well as traditional bilingual models and achieved a 10 BLEU point improvement over English-centric multilingual models. In a culmination of many years of MT research at Facebook, we’re excited to announce a major milestone: the first single massively MMT model that can directly translate 100×100 languages in any direction without relying on only English-centric data. We need one multilingual machine translation (MMT) model that can translate any language to better serve our community, nearly two-thirds of which use a language other than English. Advanced multilingual systems can process multiple languages at once, but compromise on accuracy by relying on English data to bridge the gap between the source and target languages. Typical MT systems require building separate AI models for each language and each task, but this approach doesn’t scale effectively on Facebook, where people post content in more than 160 languages across billions of posts. ![]() Today, we power an average of 20 billion translations every day on Facebook News Feed, thanks to our recent developments in low-resource machine translation and recent advances for evaluating translation quality. We’re also releasing the model, training, and evaluation setup to help other researchers reproduce and further advance multilingual models.īreaking language barriers through machine translation (MT) is one of the most important ways to bring people together, provide authoritative information on COVID-19, and keep them safe from harmful content. Today, we’re sharing details on how we built a more diverse MMT training data set and model for 100 languages. This milestone is a culmination of years of Facebook AI’s foundational work in machine translation.Deploying M2M-100 will improve the quality of translations for billions of people, especially those that speak low-resource languages. M2M-100 is trained on a total of 2,200 language directions - or 10x more than previous best, English-centric multilingual models.It outperforms English-centric systems by 10 points on the widely used BLEU metric for evaluating machine translations. Our model directly trains on Chinese to French data to better preserve meaning. When translating, say, Chinese to French, most English-centric multilingual models train on Chinese to English and English to French, because English training data is the most widely available.Facebook AI is introducing M2M-100, the first multilingual machine translation (MMT) model that can translate between any pair of 100 languages without relying on English data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |