AI4D – African Language Dataset Challenge

NLP Challenge

Getting started with programming is easy, a well-trodden path. Whether it be picking up the skill itself, a new programming language or venturing into a new domain, like Natural Language Processing (NLP), you can be sure that a variety of beginner tutorials exist to get you started. The ‘Hello World!’s, as you may know them. 

Where NLP is concerned, some paths tend to be better trodden than others. It is infinitely easier to accomplish an NLP task, say Sentiment Analysis, in English than it is to do the same in my mother tongue, Luhya. This reality is an extrapolation of the fact that the languages of the digital economy are major European languages.

The gap between languages with plenty of data available on the Internet and those without is ever increasing. Pre-trained language models in recent times have led to significant improvement in various NLP tasks and Transfer Learning is rapidly changing the field. While leading architectures for pre-training models for Transfer Learning in NLP are freely available for use, most are data-hungry. The GPT-2 model, for instance, used millions, possibly billions of text to train. (ref)

The only way I know how to begin closing this gap is by creating, uncovering and collating datasets for low resource languages. With the AI4D – African Language Dataset Challenge, we want to spur on some groundwork. While Deep Learning techniques now make it possible to dream of a future where NLP researchers and practitioners on the continent can easily innovate in the languages their communities speak, a future where literacy and mastery of a major European language is no longer a prerequisite to participation in the digital economy, these techniques require data. Data that can only be created by the communities that speak these languages, by individuals that have the technical skills, by those of us who understand the importance of this work and have the desire to undertake it.

The challenge will run for 5 months(November 2019 to March 2020), with cash prizes of USD 500 awarded as an incentive to the top 2 submissions each month. This is the first of a two-phase challenge. In this first phase, the creation of datasets. We would like to see some of these datasets developed for specific downstream tasks but this is not necessary. 

We have however earmarked four downstream NLP tasks and anticipate that one(or two) of these will be the framing of the second phase of this challenge; Sentence Classification, Sentiment Analysis, Question Answering and Machine Translation. Other downstream tasks that participants may be interested in developing datasets for, or have already developed datasets for, are also eligible. Our intention is that the datasets are kept free and open for public use under a Creative Commons license once the challenge is complete.

The challenge is hosted on Zindi, head on over to this page for full details, the prize money provided through a partnership between the International Development Research Centre (IDRC) and the Swedish International Development Cooperation Agency (SIDA), the facilitation of the challenge through combined efforts of the Artificial Intelligence for Development Network(AI4D-Africa) and the Knowledge 4 All Foundation(K4All), and finally, our expert panel that have volunteered their time to undertake the difficult qualitative aspect of dataset assessment; Jade Abbott – RetroRabbit, John Quinn – Google AI/Makerere University, Kathleen Siminyu – AI4D-Africa, Veselin Stoyanov – Facebook AI and Vukosi Marivate – University of Pretoria. 

The rest, we leave up to the community.  

Contribution by Kathleen Siminyu, AI4D-Africa Network Coordinator

Photo by Eva Blue on Unsplash.