November Review; AI4D- African Language Dataset Challenge // Bilan de novembre ; Défi AI4D – Jeu de Données sur les Langues Africaines

 

 

On the 1st of November, we launched the AI4D-African Language Dataset Challenge on Zindi, an effort towards incentivizing the uncovering and creation of African language datasets for improved representation in NLP. This first phase of what is expected to be a two-phase challenge, is taking place over 5 months, November 2019 to March 2020, with evaluation of submissions done on a monthly basis. Each month, the top 2 submissions will receive a cash prize of USD 500.

Being well into December we are excited to announce that the top two submissions for November were received from;

  • Oshingbesan Adebayo who submitted a dataset composed of three West African indigenous languages(Hausa, Igbo and Yoruba). The dataset was acquired from a wide variety of sources ranging from transcriptions of songs, online news sites, excerpts from published books, websites in indigenous languages to blogs, Twitter, Facebook and more. 
  • Thierno Diop who submitted an Automatic Speech Recognition dataset for Wolof in the domain of transportation services. The data was prepared through a collaboration between BAAMTU Datamation, a senegalease company focused on using data to help companies to leverage AI and Big Data, and WeeGo, an app which help passengers to get information about urban transport in Senegal.

Overall, we received 9 submissions in the month of November, composed of data from a total of 4 unique languages. These are Hausa, Igbo, Wolof and Yoruba.

Majority of the data came from online sources. Scraping of newspaper sites such as BBC, DW and VOA which curate news in several African languages emerged as one of the top ways that participants went about creating datasets. A great strategy for putting together a sizeable dataset over the coming months would be to keep going back to the site(s) every so often and keeping your dataset up to date with the site as news is regularly published. Capturing a wide variety of news categories would go a long way in ensuring the dataset is well balanced and representative of language variety. Wikipedia sites published in various languages also featured as a data source. 

  • BBC publishes news in Afaan Oromoo, Amharic, Hausa, Igbo, Kirundi, Pidgin, Somali, Swahili, Tigrinya and Yoruba 
  • DW publishes news in Amharic, Hausa and Kiswahili 
  • VOA publishes news in Afaan Oromoo, Amharic, Bambara, Hausa, Kinyarwanda/Kirundi, Ndebele, Shona, Somali, Kiswahili and Tigrinya

A closely related online source is Twitter data, which we have seen particularly curated for the task of sentiment analysis. A good place to start would be the accompanying Twitter profiles of the above news sites. While we haven’t had any data sourced from Facebook yet, I imagine that the profiles maintained by these news outlets for various languages would also be a good place to start.  

Manual translation also emerged with some submissions compiled as a result of one or several individuals coming together to translate pieces of text as well as custom applications such as mobile applications being used to crowdsource voice overs for the dataset created for Automatic Speech Recognition. 

I am also excited to announce that we will have a workshop at ICLR 2020, “AfricaNLP – Unlocking Local Languages”, which will be held in Addis Ababa in April of next year.
Part of the agenda of this workshop is set aside to showcase exceptional work and resulting datasets that will emerge as output from this exercise.

We will also use the workshop as an opportunity to launch the second phase of this challenge. If you have been following our thought process since the beginning, then you will know that the second phase of the challenge is largely dependent on the outcomes of this first phase. The one(or hopefully two) downstream NLP tasks that will be the object of the 2nd phase will utilise datasets that result from this first phase.

Finally, we have a Call for Papers for the workshop, specifically for research work involving African languages. Feel free to start making your submissions on this page. Here’s some key dates to keep in mind:

  • Submission deadline: 1st February, 2020
  • Notification to authors: 26th February, 2020
  • Workshop: 26th April, 2020

Happy Holidays!

Contribution by:
Kathleen Siminyu, AI4D-Africa Network Coordinator
Sackey Freshia, Jomo Kenyatta University of Agriculture and Technology
Daouda Tandiang Djiba, GalsenAI


Le 1er novembre, nous avons lancé le Défi AI4D – Ensemble de données sur les langues africaines sur Zindi, un effort pour encourager la découverte et la création  jeux de données sur les langues africaines pour une meilleure représentation en NLP. Cette première phase de ce qui devrait être un défi en deux phases, se déroule sur 5 mois, de novembre 2019 à mars 2020, avec une évaluation de la soumission faite sur une base mensuelle. Chaque mois, les deux meilleures soumissions recevront un prix en espèces de 500 USD.

Nous sommes heureux d’annoncer que les deux meilleures soumissions pour novembre ont été reçues ;

  • Oshingbesan Adebayo qui a soumis un jeu  de données composé de trois langues autochtones d’Afrique de l’Ouest (haoussa, igbo et yoruba). Le jeu  de données a été acquis auprès d’une grande variété de sources allant de transcriptions de chansons, de sites d’information en ligne, d’extraits de livres publiés, de sites Web en langues autochtones à des blogues, Twitter, Facebook et autres. 
  • Thierno Diop qui a soumis un ensemble de données de reconnaissance automatique de la parole pour le wolof dans le domaine des services de transport. Les données ont été préparées grâce à une collaboration entre BAAMTU Datamation, une société sénégalaise spécialisée dans l’utilisation des données pour aider les entreprises à tirer parti de l’intelligence artificielle et de Big Data, et WeeGo, une application qui aide les passagers à obtenir des informations sur le transport urbain au Sénégal.

Au total, nous avons reçu 9 soumissions au mois de novembre, composées de données provenant de 4 langues uniques au total. Il s’agit du haoussa, de l’igbo, du wolof et du yoruba.

La majorité des données provenaient de sources en ligne. Le grattage(scraping) de sites de journaux tels que la BBC, DW et VOA qui organisent des actualités dans plusieurs langues africaines est apparu comme l’un des principaux moyens utilisés par les participants pour créer des jeux  de données. Une excellente stratégie pour constituer un jeu de données important au cours des mois à venir serait de retourner sur le(s) site(s) de temps en temps et de garder le jeu de données à jour avec le site car des nouvelles sont régulièrement publiées. La saisie d’une grande variété de catégories de nouvelles contribuerait grandement à assurer que le jeu  de données est bien équilibré et représentatif de la variété des langues. Les sites Wikipédia publiés dans différentes langues sont également présentés comme une source de données. 

  • La BBC publie des nouvelles en afaan oromo, amharique, haoussa, igbo, kirundi, pidgin, somali, swahili, tigrinya et yoruba 
  • DW publie des nouvelles en Amharique, Hausa et Kiswahili 
  • VOA publie des informations en Afaan Oromoo, Amharique, Bambara, Haoussa, Kinyarwanda/Kirundi, Ndebele, Shona, Somali, Kiswahili et Tigrinya

Une source en ligne étroitement liée est celle des données de Twitter, que nous avons vu particulièrement bien conservée pour la tâche d’analyse des sentiments. Un bon point de départ serait les profils Twitter des sites d’information ci-dessus. Bien que nous n’ayons pas encore eu de données provenant de Facebook, j’imagine que les profils tenus par ces sites d’information dans différentes langues seraient également un bon point de départ.  

La traduction manuelle a également fait son apparition, certaines soumissions ayant été compilées à la suite de la collaboration d’une ou de plusieurs personnes pour traduire des morceaux de texte ainsi que des applications personnalisées telles que des applications mobiles utilisées pour créer des voix hors champ pour un ensemble de données créé pour la reconnaissance automatique de la parole. 

Je suis également heureux d’annoncer que nous aurons un atelier à la conférence ICLR 2020, “AfricaNLP – Unlocking Local Languages“, qui se tiendra à Addis-Abeba en avril prochain. Une partie de l’ordre du jour de cet atelier est réservée à la présentation des travaux exceptionnels et des jeux  de données qui résulteront et qui seront le fruit de cet exercice.

Nous profiterons également de l’atelier pour lancer la deuxième phase de ce défi. Si vous avez suivi notre processus de réflexion depuis le début, vous savez que la deuxième phase du défi dépend en grande partie des résultats de cette première phase. Les une (ou, espérons-le, deux) tâches de NLP en aval qui feront l’objet de la deuxième phase utiliseront les ensembles de données qui résultent de cette première phase.

Enfin, nous avons un appel à communications pour l’atelier, spécifiquement pour les travaux de recherche impliquant les langues africaines. N’hésitez pas à commencer à faire vos soumissions ici.

  • Date limite de soumission: 1er février 2020
  • Notification de la décision: 26 février 2020
  • Atelier  : 26 avril 2020

Joyeuses Fêtes!

Contribution de:
Kathleen Siminyu, Coordinatrice du réseau AI4D-Africa
Sackey Freshia, Jomo Kenyatta University of Agriculture and Technology
Daouda Tandiang Djiba, GalsenAI

Announcing the #AI4D Africa Innovation 2019 Winners

The AI for Development (AI4D) Initiative is pleased to announce the winners of the AI4D-Africa Innovation Call for Proposals 2019.

Sign up and join us to celebrate the winners at  Deep Learning Indaba 2019 at the #AI4D Network of Excellence Innovation Grant Award Ceremony:

    • Tuesday, 27th August 2019 at 7 PM (Nairobi Time)
    • (LOCATION UPDATE) Interaction Hall – KUCC, Kenyatta University, Nairobi, Kenya.

The first named individual is the Principle Investigator. Funding for these innovation seed grants is made available with the support of Canada’s International  Development Research Centre. To learn more about our Network of Excellence in Artificial Intelligence for Development in Sub-Saharan Africa click here. 

Congratulations to all recipients. Follow us at @AI4Dev. 

 

Dr. Abdelhak Mahmoudi  
Mohammed V University of Rabat, Morocco
Arabic Speech-to-MSL Translator: ‘Learning for Deaf’
To develop an Arabic text to Moroccan Sign Language (MSL) translation product through building two corpora of data on Arabic texts for the use of translation into MSL. The collected corpora of data will train Deep Learning Models to analyze and map Arabic words and sentences against MSL encodings.

 

Dr. Adewale Akinfaderin, Olamilekan Wahab and Olubayo Adekanmbi
Data Duality Lab, Data Science Nigeria, MTN Nigeria, Nigeria
Using Artificial Intelligence to Digitize Parliamentary Bills in Sub-Saharan Africa 
To improve and expand the categorizing of parliamentary bills in Nigeria using Optical Character Recognition (OCR), document embedding, and recurrent neural networks to three other countries in Africa: Kenya, Ghana, and South Africa. 

 

Dr. Amelia Taylor, Eva Mfutso-Bengo and Binart Kachule
University of Malawi and the Polytechnic, University of Malawi, Malawi
A Semi-Automatic Tool for Meta-Data Extraction from Malawi Court Judgments 
To develop a methodology for a semi-automatic classification of judgments disseminated by the High Court Library of the Malawi Judiciary with the purpose of enabling ‘intelligent searching’ within this body of knowledge.

 

Dr. Aminata Zerbo Sabane, Dr. Tegawendé Bissyande, and T. Idriss Tinto 
L’université Joseph Ki-Zerbo and La Communauté Afrique Francophone des Données Ouvertes, Burkina Faso
Preservation of Indigenous Languages 
To initiate a research roadmap for the preservation of indigenous languages through the means of collecting, categorizing and archiving of translation and voice synthesis to perform the automatic translation in official and indigenous languages. 

 

Denis Pastory Rubanga, Dr. Zekaya Never, Dr. Machuve Dina, Lilian Mkonyi, Loyani K. Loyani, Richard Mgaya.
Tokyo University of Agriculture, The Nelson Mandela African Institution of Science and Technology, and Sokoine University of Agriculture, Tanzania
A Computer Vision Tomato Pest Assessment and Prediction Tool    
Pest monitoring by using a data-driven computer vision technique in directing the extension officers support services across sub-Sahara Africa in a real-time pest damage assessment and recommendation support system for small scale tomato farmers.

 

Martha Shaka, Nyamos Waigama, Emilian Ngatunga, Halidi Maneno, Said Said, Said Mmaka, Frederick Apina, Simon Chaula, Emani Sulutya, Merikiadi Mashaka
University of Dodoma and Benjamin Mkapa Hospital, Tanzania
Effective Creation of Ground Truth Data-Set for Malaria Diagnosis Using Deep Learning 
To create an automatic data annotation tool and ground truth dataset for malaria diagnosis using deep learning. The ground truth dataset and the tool will streamline the development of AI tools for pathology diagnosis.

 

Dr. Moes Thiga and Dr. Pamela Kimeto
Kabarak University, Kenya
Early Detection of Pre-Eclampsia Using Wearable Devices and Long Short Term Memory Networks
To determine the effectiveness of Long Short Term Memory Network in the prediction of pregnant mothers at high risk of developing pre-eclampsia and the effectiveness of prophylaxis of preeclampsia.

 

Ronald Ojino and Khushal Brahmbhatt
Cooperative University of Kenya, Kenya
A Public Dataset on Poaching Trends in Kenya and a Study on the Predictive Modeling of Poaching Attacks
To test the feasibility of the deployment of Unmanned Ground Vehicles (UGVs) for automated intelligent patrol, detection, wildlife monitoring, identification across the national parks and reserves in Kenya. 

 

Steven Edward, Edward James, and Deo Shao
Nelson Mandela African Institute of Science and Technology, Tanzania
Improving the Pharmacovigilance system using Natural Language Processing  on Electronic Medical Records
To improve the pharmacovigilance system by proposing a novel algorithm for the auto-extract of adverse drug reaction cases from Electronic Medical Records and reduce the time taken and introduce the confidentiality of reporting.

 

Dr. Tegawendé F. Bissyande, Dr. Aminata Zerbo Sabane, and T. Idriss Tinto 
Université Joseph Ki-Zerbo and La Communauté Afrique Francophone des Données Ouvertes, Burkina Faso
Building a Medicinal Plant Database for Preserving Ethnopharmacological Knowledge in the Sahel 
To initiate the collection and construction of a medicinal plant database on top of which a search engine and AI-based image recognition for plants to enable scalable search of preserved knowledge.

Naser Faruqui from IDRC discussing AI and its impact for development

Play video by Naser Faruqui, IDRC, at workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

Naser in his work supports scientists and their innovations in developing countries to help solve their own problems. The confluence of bigdata, powerful computing and machine learning created a tipping point in powerful applications, and this can be transferred to development opportunities in the area of health, education and agriculture.

My bluesky is to ensure that Africans can fully contribute, participate and benefit from potential opportunities in AI

 

 

Tejumade Afonja form AI Saturdays Lagos on Artificial Intelligence and the importance of local communities

Play video by Tejumade Afonja, AI Saturdays Lagos, InstaDeep at workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

Tejumade is among other things a community builder and very passionate about AI education. In the scope of AI Saturdays in Lagos she runs a Programme to democratize AI that goes through a curriculum for 16 consecutive Saturdays of teaching machine learning.

My bluesky project for AI in Africa is support other AI communities in other parts of Africa and the world. The other is to build an African food image dataset.

 

Meet Kathleen Siminyu: One of Africa’s AI community women leaders

A little over 2 years ago, Muthoni Wanyoike and I started the Nairobi chapter of Women in Machine Learning and Data Science (WiMLDS). Kathleen Siminyu is the co-organizer of WiMLDS in Nairobi with her friend Muthoni Wanyoike, who leads a team at InstaDeep — an AI startup. Kathleen is the Head of Data Science at Africa’s Talking and part of the Steering Committee of Deep Learning Indaba.  

We were searching for our tribe. A community of people, who much like us, were either working in the data space or were curious about it and looking to explore. In the span of two years, this community has grown and evolved tremendously. We have experimented with a variety of activities; monthly meetups, quarterly study groups, hackathons, and round table discussions, and now have a following of over 2,000 people on meetup.com. Some of my personal favourite stories from the community are of the women and men who “earn their stripes” and then give back by taking it upon themselves to organize an AI community activity. Succession and continuity are important for long term impact.

When we started doing this, there was no other such space in Nairobi. Today, there are several other communities that possibly do it much better than we ever did or could. The Nairobi AI ecosystem has grown, and we contributed to it, however, our story is not unique.

Across the continent, we are seeing the creation of a vibrant African AI ecosystem. Data Science Nigeria runs programs strategically focused on capacity building, particularly in secondary schools and universities. Blossoms Academy in Ghana provides university graduates with the skills needed to launch meaningful careers in Data Science. The first edition of the North Africa Machine Learning Summer School will take place in Morocco in June of this year and the third Deep Learning Indaba will take place in August of this year in Nairobi, Kenya, in addition to smaller, independently organized IndabaX events being hosted in 27 countries across the continent; Algeria, Botswana, Burkina Faso, Burundi, Cameroon, Democratic Republic of the Congo, Egypt, Ethiopia, Ghana, Kenya, Lesotho, Malawi, Morocco, Namibia, Nigeria, Rwanda, Senegal, Somalia, South Africa, Sudan, Swaziland, Tanzania, The Gambia, Tunisia, Uganda, Zambia and Zimbabwe.

My intention with listing all these is to highlight the fact that there are increasingly more opportunities for Africans on the continent to build technical skill in the fields of Data Science and Machine Learning through communities, summer schools and now the African Masters of Machine Intelligence at African Institute for Mathematical Science (AIMS), whereas several years ago there was a dearth of such opportunities.

Map of countries hosting the IndabaX 2019 events | Deep Learning Indaba
Map of countries hosting the IndabaX 2019 events | Deep Learning Indaba

Clearly, something is a-brewing. Back in January, a friend of mine in the Nairobi WiMLDS community messaged me to wish me a happy new year and we had the following exchange;

M: What’s the plan for Data Science in Kenya in 2019? * I laughed out loud, really I did *

K: The national plan…ama?

M: Si ati national plan. If you are working on something cool I can join in. Side project or stuff…

He was asking me a question I had been asking myself for a while, “What next?”

What next for the individuals who begin engaging with the Nairobi WiMLDS community with little more than an interest. The ones who begin by picking up the basics of programming in R or Python, then collaborate on fun projects, take part in a hackathon, and give back by facilitating sessions for others. What more can they do to grow and further contribute as part of the community? Short of getting them a job, which we cannot possibly do for every individual, what more can we do to further engage and foster this community?

“The future of AI and Machine Learning (ML) will be driven by the community through grassroot movement. This is a bottom-up model where people come together to build solutions for problems they can associate with.”

Six reasons why community-driven AI is the future by Rudrab Mitra

In the context of our local WiMLDS community, the next step is open source projects. We are crowdsourcing efforts to build natural language processing (NLP) for low-resource languages among Kenya’s 67 living languages, such as Dholuo and Kamba with over 4 million speakers. An ideal problem set to begin with, in my opinion, because anyone with the know-how can begin contributing towards resources for the languages that they care about. Given the great diversity of African languages and the overall lack of tools to easily work with them, this is a problem that we can all relate to when it comes to building machine learning applications based on the processing of our own languages.

Other communities across the continent will soon find that they will be asking themselves the same question, or some variation of it if they are not already. For those focused on capacity building, then comes the question of where and how to make the best use of this capacity?

How can we begin to fix our education systems such that we no longer have to innovate and create spaces outside of it to make up for the gaps?

How can we support individuals who are innovating, get them mentorship and resources, well equip them to increase their chances of succeeding as entrepreneurs, help reimagine Kenya’s industries, and contribute to Africa’s economic and social success? To gain access to funding and credit? Do we have the necessary infrastructure and enabling systems upon which they can build and innovate?

Going forward, the road is long and no one individual holds the answers but we each must play our part if we hope to reap from the collective efforts of the community.


Contribution by Kathleen Siminyu for the AI Research Network of Excellence @AI4Dev #AI4DNetwork. She is the head of Data Science at Africa’s Talking, co-organizer of Nairobi WiMLDS, and part of the steering committee of Deep Learning Indaba.