Presentation: Statistical Machine Translation with NLTK

Statistical Machine Translation with NLTK

Liling Tan

Audience level:: Intermediate
Category:: Science

Description

NLTK toolkit is the de facto for text analytics and natural language processing for python developers. NLTK's recently extended `translate` module makes it possible for python programmers to achieve machine translation capabilities. This poster introduces the basic components of Statistical Machine Translation and demonstrates that machine translation is indeed achievable by mere mortals.

Abstract

Machine translation is considered the holy grail of Natural Language Processing (NLP). The hope of that some day we would be able to automatically translate a foreign language into our native language. The wish that we can have a babelfish device that can interpret live. The day have come where we would be able to perform machine translation in python, within our favorite NLP toolkit, Natural Language ToolKit (`NLTK`) We introduce some basic knowledge of Statistical Machine Translation (SMT) with (i) the _phrase-based machine translation_ paradigm, (ii) _noisy channel alignment model_, (iii) _ngram language models_ and (iii) _log-linear decoding model_. Thereafter, we guide the users step-by-step through the SMT processes with code-snippets from the `align` and `translate` module in the NLTK toolkit. Firstly, we present the word-alignment models available in the NLTK `align` module and then we show how to output the phrase-table with the phrases' probabilities. Finally, we introduce the stack-based decoder that produces the final translated output. Before concluding, we compare the current performance (in speed and accuracy) of the `NLTK` stack decoder with the reigning machine translation toolkit, `Moses`