Natural language is constantly evolving. With social media having its own language and interactions becoming more global, NLP models need more than just monolingual corpora to understand and make sense of all this data. Roughly, 50% of the world speaks two or more languages. This comes as a challenge to NL systems because conventional models are trained to understand one language or only translate from one to the other. In this talk, we’ll focus on Natural Language Understanding (NLU) for small multilingual texts.
A key step in building NLU systems is language identification. First, we’ll give an introduction to existing frameworks for this task in Python like cld3, langid, langdetect and will also have a short discussion on their shortcomings.
Another area of concern is transliterated and code-switched text, which consists a combination of two or more structurally different grammars and vocabulary. This type of data can be clearly seen in Tweets and comments on Facebook as well as product reviews. What makes this problem very challenging is the lack of annotated datasets and the added noise of having no “correct” grammar and spelling. We discuss the approaches to solve this using web crawlers and self-generated datasets.
The next section of this talk will be on using the multilingual BERT model released by Google, which is trained in 104 languages. We’ll see some examples of how this model performs when given text pieces in different languages. In the final section, we’ll discuss how to evaluate the model for different tasks.