Amazon Web Services Inc.’s artificial intelligence research team today announced the release of a massive new dataset, appropriately called “MASSIVE,” which it says can be used to build virtual assistants that support some of the world’s most obscure languages.
Alongside the database Amazon has also released open-source modeling code to help developers build more capable virtual assistants.
The MASSIVE database is what’s known as a “parallel dataset,” meaning that each of the utterances within it are given in all 51 languages it supports, including many obscure ones that lack labeled data to enable AI training.
The idea is that developers can use the MASSIVE database to train AI models to understand those more obscure languages to a similar degree that can be achieved with more common languages such as English.
The approach is known as massively multilingual natural language understanding, a paradigm that allows AI models to parse and understand inputs from many typologically diverse languages. By learning shared data representations that span multiple languages, AI models can transfer knowledge from languages where training data is abundant, to those in which data is scarce, Amazon explained.
Amazon said the MASSIVE database will be particularly useful in advancing spoken-language understanding, where audio is converted to text before NLU is performed. Virtual assistants like Amazon Alexa commonly use SLU to understand a user’s commands, but they only support a small fraction of the world’s 7,000-plus languages because of a lack of training data.
It’s hoped that MASSIVE, which more or less stands for Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation, can overcome this scarcity of data. The database. contains 1 million realistic, parallel, labeled virtual assistant text utterances that span 51 languages, 18 domains, 60 intents and 55 slots. It was created by professional translators who were tasked with translating or localizing the English language dataset into 50 typologically diverse languages from 29 genera, including many low-resource languages.
Amazon said the MASSIVE dataset and tools for using it are all available from its GitHub repository starting today. In addition to launching the dataset, it has also created a competition to encourage developers to work with it. The Massively Multilingual NLU 2022 competition is hosted on eval.ai and is composed of two tasks.
The first task, MMNLU-22-Full, invites developers to train and test a single AI model on all 51 languages in the MASSIVE dataset. Having done that, developers can attempt the second task, MMNLU-22-ZeroShot, which involves fine-tuning a pretrained model only with English-labeled data and then testing it on all 50 non-English languages in MASSIVE.
“This assesses the model’s ability to generalize to new languages, an important consideration given the number of languages around the world for which there is little-to-no labeled data,” Amazon’s AI team wrote in a blog post. “Zero-shot learning is a key technology for scaling NLU technology to many more low-resource languages worldwide.”
Amazon has launched a MASSIVE leaderboard to keep track of participants in the competition, which runs until Aug. 8. Winners will then be invited to give an oral presentation of their work, either in-person or virtually, at the EMNLP 2022 conference that takes place in Abu Dhabi in December.