Talks AI

Making African Languages Visible: A Python-Based Guide to Low-Resource Language ID

Friday, May 15th, 2026 12:30 p.m.–1:15 p.m. in Grand Ballroom A

Presented by

Experience Level:

Some experience

Description

African languages remain heavily underrepresented in NLP, and building reliable language identification tools for them is still a major challenge. In this session, we explore how Python and FastText can be utilised to develop practical language detection systems for low-resource African languages, drawing insights from the MasakhaNER dataset on Huggingface, one of the most comprehensive open-source African language corpora.

The talk begins with an overview of the unique characteristics of African languages that affect NLP performance, including dialect diversity, orthographic variation, code-switching, and limited labelled resources. We then outline a clear workflow for preparing multilingual datasets, selecting features, and evaluating language identification models, with a focus on the realistic constraints faced in low-resource environments.

A central part of the talk compares FastText with other African NLP tools such as AfroXLMR, Masakhane Models, and spaCy’s limited-language pipelines. This comparison highlights key differences in language coverage, model size, task flexibility, and production readiness. Attendees will gain practical guidance on when FastText is sufficient, when transformer-based models offer clear advantages, and how to navigate trade-offs around accuracy, speed, and resource usage.

This session emphasizes conceptual clarity, reproducible steps, and real-world lessons from applying these tools to African language datasets. The audience will leave with a strong understanding of the challenges and opportunities in low-resource language identification, along with actionable strategies for designing more inclusive NLP systems.