News Archives

Building a Data Pipeline for a Real World Machine Learning Application

We set out with a novel idea; to develop an application that would (i) collect an individual’s Blood Pressure (BP) and activity data, and (ii) make future BP predictions for the individual with this data.

Key requirements for this study therefore were;

  1. The ability to get the BP data from an individual.
  2. The ability to get a corresponding record of their activities for the BP readings.
  3. The identification of a suitable Machine Learning (ML) Algorithm for predicting future BP.

Pre-test the idea – Pre testing the idea was a critical first step in our process before we could proceed to collect the actual data. The data collection process would require the procurement of suitable smart watches and the development of a mobile application, both of which are time consuming and costly activities. At this point we learnt our first lessons; (i) there was no precedence to what we were attempting and subsequently (ii) there were no publicly available BP data sets available for use in pre-testing our ideas.

Simulate the test data – The implication therefore was that we had to simulate data based on the variables identified for our study. The variables utilized were the Systolic and Diastolic BP Reading, Activity and a timestamp. This was done using a spreadsheet and the data saved as a comma separate values (csv) file. The csv is a common file format for storing data in ML.

Identify a suitable ML model – The data simulated and that in the final study was going to be time series data. The need to predict both the Systolic and Diastolic BP using previous readings, activity and timestamps meant that we were was handling a multivariate time series data. We therefore tested and settled on an LSTM model for multivariate time series forecasting based on a guide by Dr Jason Browniee (https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)

Develop the data collection infrastructure – There being no pre-existing data for the development implied that we had to collect our data. The unique nature of our study, collecting BP and activity data from individuals called for an innovative approach to the process.

  • BP data collection – for this aspect of the study we established that the best way to achieve this would be the use of smart watches with BP data collection and transmission capabilities. In addition to the BP data collection, another key consideration for the device selection was affordability. This was occasioned both by the circumstances of the study, limited resources available and more importantly, the context of use of a probable final solution; the watch would have to be affordable to allow for wide adoption of the solution.

The watch identified was the F1 Wristband Heart and Heart Rate Monitor.

  • Activity data collection – for this aspect of the study a mobile application was identified as the method of choice. The application was developed to be able to receive BP readings from the smart watch and to also collect activity data from the user.

Test the data collection – The smart watch – mobile app data collection was tested and a number of key observations were made.

  • Smart watch challenges – In as much as the watch identified is affordable it does not work well for dark skinned persons. This is a major challenge given the fact that a majority of people in Kenya, the location of the study and eventual system use, are dark skinned. As a result we are examining other options that may work in a universal sense.
  • Mobile app connectivity challenges – The app initially would not connect to the smart watch but this was resolved and the data collection is now possible.

Next Steps

  • Pilot the data collection – We are now working on piloting the solution with at least 10 people over a period of 2 – 3 weeks. This will give us an idea on how the final study will be carried out with respect to:
  1. How the respondents use the solution,
  2. The kind of data we will be able to actually get from the respondents
  3. The suitability of the data for the machine learning exercise.
  • Develop and Deploy the LSTM Model – We shall then develop the LSTM model and deploy it on the mobile device to examine the practicality of our proposed approach to BP prediction.

Extracting meta-data from Malawi Court Judgments

We have set the task to develop semi-automatic methods for extracting key information from criminal cases issued by courts in Malawi. Our body of court judgments came partly from the MalawiLii platform and partly from the High Court Library in Blantyre, Malawi. We focussed our first analysis on cases between 2010 – 2019.

Amelia Taylor, University of Malawi | UNIMA · Information Technology and Computing
Amelia Taylor, University of Malawi | UNIMA · Information Technology and Computing

Here is an example of a case for which a PDF is available on MalawiLii. Here is an example of a case for which only a scanned image of a pdf is available. We used OCR for more than 90% of data to extract the text for our corpus (see below a description of our corpus).

Please open these files to familiarise yourself with the content of a court criminal judgment. What kind of information we want to extract?  For each case we wanted:

  1. Name of the Case
  2. Number of the Case
  3. Year in which the case was filled
  4. Year in which the judgment was given, Court which issued the judgment
  5. Names of Judges
  6. Names of parties involved (appellants and respondents, but you can take this further and extract names of principal witnesses, and names of victims)
  7. References to other Cases
  8. References to Laws/Statues and Codes, and,
  9. Legal keywords which can help us classify the cases according to the ICCS classification.

This project has taught us so much about working with text, preparing data for a corpus, exchange formats for the corpus data, analysing the corpus using lexical tools, and machine learning algorithms for annotating and extracting information from legal text.

Along the way we experimented also with batch OCR processing and different annotation formats such as IOB tagging[1], and the XML TEI[2] standard for sharing and storing the corpus data, but also with the view of using these annotations in sequence-labelling algorithms.

Each has advantages and disadvantages, the IOB tagging does not allow nesting (or multiple labelling for the same element), while an XML notation would allow this but it is more challenging to use in algorithms. We also learned how to build a corpus, and experimented with existing lexical tools for analysing this corpus and comparing it to other legal corpora.

We learned how to use POS annotations and contextual regular expressions to extract some of our annotations for laws and case citations and we generated more than 3000 different annotations. Another interesting thing we learned is that preparing annotated training data is not easy, for example, most algorithms require training examples to be of the same size and the training set needs to be a good representation of the data.

We also experimented with the classification algorithms and topics detection using skitlearn, spacy, weka and mathlab. The hardest task was to prepare the data in the right format and to anticipate how this data will lead to the outputs we saw. We felt that time spent in organising and annotating well is not lost but will result in gains in the second stage of the project when we focus on algorithms.

Most algorithms split the text into tokens, and for us, multi-word tokens (or sequences) are those we want to find and annotate. This means a focus on sequence-labelling algorithms. The added complications which are peculiar to legal text is that most of our key terms belong logically to more than one label, and the context of a term can span multiple chunks (e.g., sentences).

When using LDA (Latent Dirichlet Association) to detect topics in our judgments, it became clear to us that one needs to use a somehow ‘sumarised’ version in which we collapse sequences of words into their annotations  (this is because LDA uses term frequency-based measure of keyword relevance, whereas in our text the most relevant words may appear much less frequently than others).

Our work has highlighted to us the benefits and importance of multi-disciplinary cooperation. Legal text has its peculiarities and complexities so having an expert lawyer in the team really helped!

Finding references to laws and cases is made slightly more complicated because of the variety in which these references may appear or because of the use of “hereinafter”. Legal text makes use of “hereinafter”[3], e.g., Mwase Banda (“hereinafter” referred to as the deceased). But this can also happen for references to laws or cases as the following example shows:

Section 346 (3) of the Criminal Procedure and Evidence Code Cap 8:01 (hereinafter called “the Code”) which Wesbon J  was faced with in the case of  DPP V Shire Trading CO. Ltd (supra) is different from the wording of Section 346 (3) of the Code  as it stands now.

Compare extracting the reference to law from “Section 151(1) of the Criminal Procedure and Evidence Code” to extracting from “Our own Criminal Procedure and Evidence Code lends support to this practice in Sections 128(d) and (f)”. We have identified a reasonably large number of different references to laws and cases used in our text!  The situation is very similar for case citations. Consider the following variants:

  • Republic v Shautti , Confirmation case No. 175 of 1975 (unreported)
  • Republic v Phiri [ 1997] 2 MLR 68
  • Republic v Francis Kotamu , High Court PR Confirmation case no. 180 of 2012 ( unreported )
  • Woolmington v DPP [1935] A.C. 462
  • Chiwaya v Republic  4 ALR Mal. 64
  • Republic v Hara 16 (2) MLR 725
  • Republic v Bitoni Allan and Latifi Faiti

Something for you to Do Practically! To play with some annotations and appreciate the diversity in formats, and at the same time the huge savings that a semi-automatic annotation can bring, we have set up a doccano platform for you: you log in here using the user guest and password Gu3st#20.

Annotating with keywords for the purposes of the ICCS classification proved to be even harder. The International Classification of Crime for Statistical Purposes (ICCS)[4] and it is a classification of crimes as defined in the national legislations and comes on several levels each with varying degrees of the specification. We considered mainly the Level 1 and we wanted to classify our judgments according to the 11 types in Level 1 as shown in the Table.

Table 1: Level 1 sections of the ICCS
Table 1: Level 1 sections of the ICCS

We discovered that this task of classification according to Level 1 requires a lot of work and it is of a significant complexity (and the complexities only grow if we would consider the sublevels of the ICCS).  First, the legal expert of our team manually classified all criminal cases of 2019 according to Level 1 ICCS and worked on a correspondence between the Penal Code and the ICCS classification.  This is excellent.

We are in the process of extending this to mapping other Malawi laws, codes and statutes that are relevant to criminal cases into the ICCS. This in itself is a whole project on its own for the legal profession and requires processing a lot of text and making ‘parallel correspondences’! Such national correspondence tables are still work in progress in most countries and to our knowledge, our work is the first of such work for Malawi.

Looking at Level 1 of the ICCS meant we were kept very busy. Our research centred on hard and important questions.  How to represent our text so that it can be processed efficiently? What kind of data labels are most useful for the ICCS classification? What type of annotations to use (IOB or an xml-based)? What algorithms to employ (Hidden Markov Models or Recurrent Neural Networks or Long Short Term Memory)? But most importantly, we focussed on how to prepare our annotated data to be used with these algorithms?

We need to be mindful that this is a fine classification because we have to distinguish between texts that are quite similar. For example, if we wanted to classify whether a judgment by the type of law it falls under, say whether it is either civil or criminal case, this would have been slightly easier because the keywords/vocabulary used in civil cases would be quite different than that used in criminal cases.

We want to distinguish between types of crimes, and the language used in our judgments is very similar. Within our data set there is the level of difficulty, e.g.,  theft and murder cases may be easier to differentiate, that is Type 1 and 7 from the table above, than, say, to differentiate between types 1 and 2.

We have the added complication that most text representation models which define the relevance of a keyword as given by its frequency (whether that is TF or TF-IDF) but in our text, a word may appear only once and still be the most significant word for the purpose of our classification. For example, a keyword that distinguishes between type 1 and type 2 murders is “malice aforethought” and this may only occur once in the text of the judgment.

To help with this situation, one can extract first the structure of the judgment and focus only on the part that deals with the sentence of the judge. Indeed, there is research that focuses only on extracting various segments of a judgment.

This may work in many cases because usually the sentence is summarised in one paragraph. But it does not work for all cases. This is so especially when the case history is long, the crime committed has several facets, or the case has several counts, e.g., the murder victim is an albino or a disabled person.

In such situations one needs a combined strategy which uses: (1) An good set of annotated text with meta-data described above; (2) the mapping of the Penal Code/ Laws/Statues relevant to the ICCS; (3) collocations of words/ or a thesaurus and (4) concordances to help us detect clusters and extract relevant portions of the judgments; (5) employing sequence modelling algorithms, e.g., HMM, recurrent neural networks, for annotation and classification.

In the first part of the project, we focussed on the tasks (1) – (4) and experimented to some extent with (5).  What we wanted is to find a representation of our text based on all the information at (1) – (4) and attempt to use that in the algorithms we employ.

We have created a training set of over 2500 annotations for references to sections of the law and over 1000 annotations for references to other cases. We are still preparing these so that they are representative of the corpus and are good examples.

And finally but most importantly, while working on this AI4D project, it has brought me in contact with very clever people, whom I would have not otherwise met. We appreciate the support and guidance of the AI4D team!

[1] https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

[2] http://fedora.clarin-d.uni-saarland.de/teaching/Corpus_Linguistics/Tutorial_XML.html

[3] Hereinafter is a term that is used to refer to the subject already mentioned in the remaining part of a legal document. Hereinafter can also mean from this point on in the document.

[4] United Nations Economic Commission for Europe. Conference of European Statisticians. Report of the UNODC/UNECE Task Force on Crime Classification to the Conference of European Statisticians. 2011. Available: www.unodc.org/documents/data-andanalysis/statistics/crime/Report_crime_classification_2012.pdf>

Maria Fasli, University of Essex, UNESCO Chair in Data Science and Analytics on developing AI solutions in Africa

Play video by Maria Fasli, University of Essex, UNESCO Chair in Data Science and Analytics at workshop "Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa", Nairobi, Kenya, April 2019
Play video by Maria Fasli, University of Essex, UNESCO Chair in Data Science and Analytics at workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on at the moment?

My name is Maria Fasli, I am a professor of computer science and my area of expertise is in Artificial Intelligence. I work for the University of Essex in the UK. I work in arrange of projects, with both, industry as well as public sector organizations, trying to help them to understand the data that they have, their needs around data and how to make better use of their data.

How do you perceive development and Artificial Intelligence?

This is a really interesting question; I think AI has a really big roll to play in development. We need to bring AI into the developing countries and transitioning countries, to make a difference on the ground. It is not about us making up solutions in the west, but it is about developing solutions here locally.

There is a whole area that we need to work on around developing capacity and helping people create the right networks here in Africa as well as in other areas in the world, South Africa, Southeast Asia, to make a difference.

There is a big scope to use AI to support sustainable development goals and make progress, help developing and transitioning countries, develop into knowledge economies so that they can be the ones that have the power to make a difference for their own citizens.

What is your blue sky project in Africa?

This is another really good question. In the west, we’ve been using surveys to collect data and we’ve been doing clinical trials, we’re always trying to learn in a very structured kind of way. What I would like to work on if I had an unlimited budget is techniques that can learn and reason from observation on data.

Where are you trying instead of running a survey and collecting data about the population where you can control what it is that you’re getting back. Learning from the kind of data that is already available, because there is an abundance of data, but we’re currently lacking the techniques and trying to make sense out of this data.

How do you feel about the workshop?

I think it has been amazing, we’ve made a lot of progress, we’ve had concrete ideas coming out as the next steps and I look forward to personally supporting the initiative going forward if I’m needed in whichever way is possible.

Do you have a one-liner for us? One line?

A slogan. AI for all!

December Review; AI4D- African Language Dataset Challenge // Bilan de decembre; Défi AI4D – Jeu de Données sur les Langues Africaines

The close of 2019 marked the second month of the AI4D African Dataset Challenge, an effort aimed at incentivizing the uncovering and creation of African language datasets for improved representation in NLP. This challenge is hosted on Zindi and has been ongoing since the 1st of November. Each month we take stock and award a total of USD 1000 to the two most outstanding submissions.

In December, these two were as follows;

  • A Yoruba dataset submitted by David Adelani. This submission was put together by three individuals, David, Damilola Adebonojo and Omo Yooba, the latter two of whom are major Yoruba contributors for Global Voices Lingua, a movement which aims to bridge worlds and amplify voices through translating stories into dozens of languages. Beyond including some of the news stories from the Global Voices website, they translated several chapters of a book, got parallel sentences from a Twitter account that posts Yoruba proverbs, translated part of a movie dialogue found on YouTube and supplemented these with multi-domain sentences containing scientific and medical terms to work towards a representative dataset.
  • A Fongbe submission composed of datasets prepared for two tasks; 
    • Fongbe-French Machine Translation with data sourced from Bible translations, scraping a website and translating a book freely available online.
    • Automatic Speech Transcription data consisting of phoneme labels, single-speaker audio sentences as well as multi-speaker conversational audios.

We received 6 submissions in December, composed of data from 4 languages, Fongbe, Igbo, Swahili and Yoruba. This brings our overall language total, taking into consideration November and December submissions, to 6; Fongbe, Hausa, Igbo, Swahili, Wolof and Yoruba.

We observed one novel data collection process that involved first scanning text from a book containing a collection of folk-tales then digitizing these using Google’s Text Recognition software for Optical Character Recognition(OCR).  There was also a notable submission of Igbo names, a valuable resource that can be incorporated into the task of Named Entity Recognition. To learn more about other techniques being to create datasets, be sure to check the November round-up here.

As we begin evaluation of the January submissions, we continue to be impressed by the calibre of datasets submitted and the effort put into their creation. 

This work actively challenges us to think deeper about the various copyright implications of some of these data collection sources and processes and the modality of finally making all this data open. In addition to the choice of dataset to use for a Machine Learning task in the second phase of this challenge, as each month brings us closer to the end of the dataset creation phase.

Contribution by:
Kathleen Siminyu, AI4D-Africa Network Coordinator
Sackey Freshia, Jomo Kenyatta University of Agriculture and Technology
Daouda Tandiang Djiba, GalsenAI


La fin de l’année 2019 a marqué le deuxième mois du défi AI4D African Dataset Challenge, un effort visant à encourager la découverte et la création de jeux de données sur les langues africaines pour une meilleure représentation en NLP. Ce défi est hébergé sur Zindi et se déroule depuis le 1er novembre. Chaque mois, nous faisons le point et attribuons un total de 1000 USD aux deux meilleures soumissions.

En décembre, il s’agissait des deux suivantes ;

  • Un jeu de données Yoruba soumis par David Adelani. Cette soumission a été réalisée par trois personnes, David, Damilola Adebonojo et Omo Yooba, ces deux derniers étant des contributeurs yorubas majeurs pour Global Voices Lingua, un mouvement qui vise à rapprocher les mondes et à amplifier les voix en traduisant des histoires dans des dizaines de langues. En plus d’inclure certains des articles du site web de Global Voices, ils ont traduit plusieurs chapitres d’un livre, obtenu des phrases parallèles d’un compte Twitter qui publie des proverbes yorubas, traduit une partie d’un dialogue de film trouvé sur YouTube et complété ces derniers par des phrases multi-domaines contenant des termes scientifiques et médicaux pour travailler sur un jeu de données représentatif.
  • Une soumission Fongbe composée d’un jeu de données préparées pour deux tâches ; 
    • La traduction automatique Fongbe-français avec des données provenant de traductions de la Bible, en grattant un site web et en traduisant un livre disponible gratuitement en ligne.
    • Données de transcription automatique de la parole comprenant des étiquettes de phonèmes, des phrases audio à un seul locuteur ainsi que des audios conversationnels à plusieurs locuteurs.

 

Nous avons reçu 6 soumissions en décembre, composées de données provenant de 4 langues, le fongbe, l’igbo, le swahili et le yoruba. Cela porte à 6 le nombre total de langues, en tenant compte des contributions de novembre et de décembre : le fongbe, le haoussa, l’igbo, le swahili, le wolof et le yoruba.

Nous avons observé un nouveau processus de collecte de données qui consistait à scanner le texte d’un livre contenant un ensemble de contes populaires, puis à numériser ces derniers à l’aide du logiciel de reconnaissance de texte de Google pour la reconnaissance optique de caractères (OCR). 

Il y a également eu une soumission notable de noms Igbo, une ressource précieuse qui peut être incorporée dans la tâche de reconnaissance des entités nommées. Pour en savoir plus sur les autres techniques de création de jeu de données, consultez le résumé de novembre ici.

Alors que nous commençons l’évaluation des soumissions de janvier, nous continuons à être impressionnés par la qualité des jeux de données soumis et par les efforts déployés pour leur création. 

Ce travail nous met activement au défi de réfléchir plus en profondeur aux diverses implications en matière de droits d’auteur de certaines de ces sources et processus de collecte de données et à la modalité de rendre enfin toutes ces données ouvertes. Outre le choix de l’ensemble de données à utiliser pour une tâche d’apprentissage automatique dans la deuxième phase de ce défi, puisque chaque mois nous rapproche de la fin de la phase de création de l’ensemble de données.

Contribution de:
Kathleen Siminyu, Coordinatrice du réseau AI4D-Africa
Sackey Freshia, Jomo Kenyatta University of Agriculture and Technology
Daouda Tandiang Djiba, GalsenAI

Delmiro Fernandez-Reyes form UCL on how AI can deliver better medicines in Africa

Delmiro Fernandez-Reyes, University College London at workshop "Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa", Nairobi, Kenya, April 2019
Delmiro Fernandez-Reyes, University College London at the workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on at the moment?

I’m based at the department of computer science, University College London and as well at the College of Medicine at the University of Ibadan. My work is related to solutions for global health challenges such as paediatric infections, malaria, or communicable or noncommunicable diseases.

The work has been basically harnessing algorithms we develop, to actually look at the data that can improve diagnostics, or can improve clinical pathways, or can actually as well make decisions faster, therefore savings on the healthcare systems which are a stretch.

So basically, we can focus on challenges on this global health problems. What I do at the moment is develop the hardware that AI is going to work on, we develop a microscope itself that have a lot of AI components, which for the diagnostics like navigation, detection of the specific objects, like malaria, parasites and all the etymological aspects of malaria screening.

Another important part of what we do, I think that the role of AI as I see as a person who works in challenges in health in the region is that in Africa it is more transformative because it creates opportunity. For example, these projects, the ones I’m talking about, are already running, they are generating employment, they are generating teams. This is being now developed to use technology in the frontline.

We have a tool that improves MRI resolution and that is now being used by radiologists in Nigeria. Through those tools you can train people, professionals, increase interdisciplinarity, so it opens opportunity, which is the opposite as you see in the north countries or in the west, AI seems to be to take jobs out of people or doing tasks. I think in Africa you can use it as challenges that will increase development or the region.

How do you perceive development and Artificial Intelligence?

The way to facilitate development is focusing on the challenges the region has. The region of many challenges, from technological gaps to the ones of governance.

I want to focus on the ones closest to me, because of my background as a basic scientist in medicine and computer science. In those areas, we can clearly see that we can aid the developing areas of improving the key drivers of lack of development, which is inequality, neonatal mortality, maternal mortality. Those are actually three axes that drive the region.

The region has still too many communicable diseases, HLB, Tuberculosis, malaria, those are now the challenge. Another challenge is, as people are getting older in southern Africa, like Nigeria, span is increasing with the GDP increase, you will have a bigger impact on noncommunicable diseases.

For those, I think we can bring a lot of management, health care systems, policy-making and strategies for that. Of course, there is another area on the development that you cannot do that only for the health, you have to develop, power, infrastructure and water – sanitation, so there needs to be a concerted element to this, you cannot have only the health people working alone, has to be the engineers of infrastructure at the same time or telecommunications.

What is your blue sky project in Africa?

The main project that we will focus on is what we are already doing. We would like to have AI-driven platform for diagnosis of diseases fast in clinical labs. You can achieve that.

November Review; AI4D- African Language Dataset Challenge // Bilan de novembre ; Défi AI4D – Jeu de Données sur les Langues Africaines

 

 

On the 1st of November, we launched the AI4D-African Language Dataset Challenge on Zindi, an effort towards incentivizing the uncovering and creation of African language datasets for improved representation in NLP. This first phase of what is expected to be a two-phase challenge, is taking place over 5 months, November 2019 to March 2020, with evaluation of submissions done on a monthly basis. Each month, the top 2 submissions will receive a cash prize of USD 500.

Being well into December we are excited to announce that the top two submissions for November were received from;

  • Oshingbesan Adebayo who submitted a dataset composed of three West African indigenous languages(Hausa, Igbo and Yoruba). The dataset was acquired from a wide variety of sources ranging from transcriptions of songs, online news sites, excerpts from published books, websites in indigenous languages to blogs, Twitter, Facebook and more. 
  • Thierno Diop who submitted an Automatic Speech Recognition dataset for Wolof in the domain of transportation services. The data was prepared through a collaboration between BAAMTU Datamation, a senegalease company focused on using data to help companies to leverage AI and Big Data, and WeeGo, an app which help passengers to get information about urban transport in Senegal.

Overall, we received 9 submissions in the month of November, composed of data from a total of 4 unique languages. These are Hausa, Igbo, Wolof and Yoruba.

Majority of the data came from online sources. Scraping of newspaper sites such as BBC, DW and VOA which curate news in several African languages emerged as one of the top ways that participants went about creating datasets. A great strategy for putting together a sizeable dataset over the coming months would be to keep going back to the site(s) every so often and keeping your dataset up to date with the site as news is regularly published. Capturing a wide variety of news categories would go a long way in ensuring the dataset is well balanced and representative of language variety. Wikipedia sites published in various languages also featured as a data source. 

  • BBC publishes news in Afaan Oromoo, Amharic, Hausa, Igbo, Kirundi, Pidgin, Somali, Swahili, Tigrinya and Yoruba 
  • DW publishes news in Amharic, Hausa and Kiswahili 
  • VOA publishes news in Afaan Oromoo, Amharic, Bambara, Hausa, Kinyarwanda/Kirundi, Ndebele, Shona, Somali, Kiswahili and Tigrinya

A closely related online source is Twitter data, which we have seen particularly curated for the task of sentiment analysis. A good place to start would be the accompanying Twitter profiles of the above news sites. While we haven’t had any data sourced from Facebook yet, I imagine that the profiles maintained by these news outlets for various languages would also be a good place to start.  

Manual translation also emerged with some submissions compiled as a result of one or several individuals coming together to translate pieces of text as well as custom applications such as mobile applications being used to crowdsource voice overs for the dataset created for Automatic Speech Recognition. 

I am also excited to announce that we will have a workshop at ICLR 2020, “AfricaNLP – Unlocking Local Languages”, which will be held in Addis Ababa in April of next year.
Part of the agenda of this workshop is set aside to showcase exceptional work and resulting datasets that will emerge as output from this exercise.

We will also use the workshop as an opportunity to launch the second phase of this challenge. If you have been following our thought process since the beginning, then you will know that the second phase of the challenge is largely dependent on the outcomes of this first phase. The one(or hopefully two) downstream NLP tasks that will be the object of the 2nd phase will utilise datasets that result from this first phase.

Finally, we have a Call for Papers for the workshop, specifically for research work involving African languages. Feel free to start making your submissions on this page. Here’s some key dates to keep in mind:

  • Submission deadline: 1st February, 2020
  • Notification to authors: 26th February, 2020
  • Workshop: 26th April, 2020

Happy Holidays!

Contribution by:
Kathleen Siminyu, AI4D-Africa Network Coordinator
Sackey Freshia, Jomo Kenyatta University of Agriculture and Technology
Daouda Tandiang Djiba, GalsenAI


Le 1er novembre, nous avons lancé le Défi AI4D – Ensemble de données sur les langues africaines sur Zindi, un effort pour encourager la découverte et la création  jeux de données sur les langues africaines pour une meilleure représentation en NLP. Cette première phase de ce qui devrait être un défi en deux phases, se déroule sur 5 mois, de novembre 2019 à mars 2020, avec une évaluation de la soumission faite sur une base mensuelle. Chaque mois, les deux meilleures soumissions recevront un prix en espèces de 500 USD.

Nous sommes heureux d’annoncer que les deux meilleures soumissions pour novembre ont été reçues ;

  • Oshingbesan Adebayo qui a soumis un jeu  de données composé de trois langues autochtones d’Afrique de l’Ouest (haoussa, igbo et yoruba). Le jeu  de données a été acquis auprès d’une grande variété de sources allant de transcriptions de chansons, de sites d’information en ligne, d’extraits de livres publiés, de sites Web en langues autochtones à des blogues, Twitter, Facebook et autres. 
  • Thierno Diop qui a soumis un ensemble de données de reconnaissance automatique de la parole pour le wolof dans le domaine des services de transport. Les données ont été préparées grâce à une collaboration entre BAAMTU Datamation, une société sénégalaise spécialisée dans l’utilisation des données pour aider les entreprises à tirer parti de l’intelligence artificielle et de Big Data, et WeeGo, une application qui aide les passagers à obtenir des informations sur le transport urbain au Sénégal.

Au total, nous avons reçu 9 soumissions au mois de novembre, composées de données provenant de 4 langues uniques au total. Il s’agit du haoussa, de l’igbo, du wolof et du yoruba.

La majorité des données provenaient de sources en ligne. Le grattage(scraping) de sites de journaux tels que la BBC, DW et VOA qui organisent des actualités dans plusieurs langues africaines est apparu comme l’un des principaux moyens utilisés par les participants pour créer des jeux  de données. Une excellente stratégie pour constituer un jeu de données important au cours des mois à venir serait de retourner sur le(s) site(s) de temps en temps et de garder le jeu de données à jour avec le site car des nouvelles sont régulièrement publiées. La saisie d’une grande variété de catégories de nouvelles contribuerait grandement à assurer que le jeu  de données est bien équilibré et représentatif de la variété des langues. Les sites Wikipédia publiés dans différentes langues sont également présentés comme une source de données. 

  • La BBC publie des nouvelles en afaan oromo, amharique, haoussa, igbo, kirundi, pidgin, somali, swahili, tigrinya et yoruba 
  • DW publie des nouvelles en Amharique, Hausa et Kiswahili 
  • VOA publie des informations en Afaan Oromoo, Amharique, Bambara, Haoussa, Kinyarwanda/Kirundi, Ndebele, Shona, Somali, Kiswahili et Tigrinya

Une source en ligne étroitement liée est celle des données de Twitter, que nous avons vu particulièrement bien conservée pour la tâche d’analyse des sentiments. Un bon point de départ serait les profils Twitter des sites d’information ci-dessus. Bien que nous n’ayons pas encore eu de données provenant de Facebook, j’imagine que les profils tenus par ces sites d’information dans différentes langues seraient également un bon point de départ.  

La traduction manuelle a également fait son apparition, certaines soumissions ayant été compilées à la suite de la collaboration d’une ou de plusieurs personnes pour traduire des morceaux de texte ainsi que des applications personnalisées telles que des applications mobiles utilisées pour créer des voix hors champ pour un ensemble de données créé pour la reconnaissance automatique de la parole. 

Je suis également heureux d’annoncer que nous aurons un atelier à la conférence ICLR 2020, “AfricaNLP – Unlocking Local Languages“, qui se tiendra à Addis-Abeba en avril prochain. Une partie de l’ordre du jour de cet atelier est réservée à la présentation des travaux exceptionnels et des jeux  de données qui résulteront et qui seront le fruit de cet exercice.

Nous profiterons également de l’atelier pour lancer la deuxième phase de ce défi. Si vous avez suivi notre processus de réflexion depuis le début, vous savez que la deuxième phase du défi dépend en grande partie des résultats de cette première phase. Les une (ou, espérons-le, deux) tâches de NLP en aval qui feront l’objet de la deuxième phase utiliseront les ensembles de données qui résultent de cette première phase.

Enfin, nous avons un appel à communications pour l’atelier, spécifiquement pour les travaux de recherche impliquant les langues africaines. N’hésitez pas à commencer à faire vos soumissions ici.

  • Date limite de soumission: 1er février 2020
  • Notification de la décision: 26 février 2020
  • Atelier  : 26 avril 2020

Joyeuses Fêtes!

Contribution de:
Kathleen Siminyu, Coordinatrice du réseau AI4D-Africa
Sackey Freshia, Jomo Kenyatta University of Agriculture and Technology
Daouda Tandiang Djiba, GalsenAI

Vukosi Marivate from University of Pretoria on Africa’s position in AI

Play video by Vukosi Marivate, University of Pretoria, CSIR, Deep Learning at workshop "Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa", Nairobi, Kenya, April 2019
Play video by Vukosi Marivate, University of Pretoria, CSIR, Deep Learning Indaba  at the workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on at the moment?

I am doctor Vukosi Marivate. I am a chair of data science at the University of Pretoria in South Africa and I am also here representing Deep Learning Indaba. My work mostly is involved in machine learning and natural language processing as well as how we use data science for society.

How do you perceive development and Artificial Intelligence?

I see AI as being a tool that we can use in society, so not restricting it for development and on the continent, I believe that we all have our own challenges, doesn’t matter where you are and how can we use AI as one of the tools that could be used to improve the lives of Africans. If we start from there all the other things follow.

What is your blue sky project in Africa?

As Africa, I think we are in an interesting position when we’re trying to look at AI and how it can be used. One of the things that become important is also demystifying it for the public and decision-makers. I think the blue sky is how do we get AI to be interpretable and transparent. That’s one big part, there should be more work done in that. It’s great having very accurate models or high accuracy, low error, but how then does somebody else interpret what is going on and understand it. Because I think that is where a lot of the bias creeps in is, things are used without them being understood of why they are working the way that they work.

How do you feel about the workshop?

The workshop has been great, its been really meeting with a lot of great minds from across the continent and beyond. I am looking forward to seeing what we do with the network after this.

Short one-liner if you have one?

Okay. For what? For the workshop? Just a slogan. Oh, we said we need to capacitate AI strength on the African continent through our communities.

 

Benjamin Rosman from University of the Witwatersrand on AI and development

Benjamin Rosman, University of the Witwatersrand / CSIR at workshop "Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa", Nairobi, Kenya, April 2019
Benjamin Rosman, University of the Witwatersrand / CSIR at the workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on at the moment?

I am Benjamin Rosman. I work at the University of the Witwatersrand in South Africa, Johannesburg. And I also work at the CSIR, which is the Council of Scientific and Industrial Research in South Africa. And then with another head, I am also one of the founders and organizers of the Deep Learning Indaba.

In my research lab, which is based mainly at the University of the Witwatersrand, we focus mainly on questions around machine learning and decision theory, so we work in predominately in reinforcement learning in deep learning and areas around that and we recently started working in applied areas as well.

How do you perceive development and Artificial Intelligence?

I think that the combination of AI and development is an interesting one. I think AI provides an opportunity to solve a lot of problems in the developing world, as it is currently around the world in general.

I think there is a lot of opportunities for students and society more generally to get involved in acquiring these tools, which can be used in a wide variety of industries and application areas. if we think about this right, there is an opportunity to make a large impact and train a lot of people in very impactful areas.

What is your blue sky project in Africa?

What I would really love to see and there are so many research topics that I would love to work on, but what I would really love to see is a pipeline from fundamental research in Africa to applied research, considering aspects of ethics and society and finally with the pipeline running through all the way through to commercialization, so that we can be training academics, we can be, educating society in general, we can be starting start-ups and improving the way that large corporates and governments work across the continent.

 

AI4D – African Language Dataset Challenge // Défi AI4D – Jeu de Données sur les Langues Africaines

NLP Challenge

Getting started with programming is easy, a well-trodden path. Whether it be picking up the skill itself, a new programming language or venturing into a new domain, like Natural Language Processing (NLP), you can be sure that a variety of beginner tutorials exist to get you started. The ‘Hello World!’s, as you may know them. 

Where NLP is concerned, some paths tend to be better trodden than others. It is infinitely easier to accomplish an NLP task, say Sentiment Analysis, in English than it is to do the same in my mother tongue, Luhya. This reality is an extrapolation of the fact that the languages of the digital economy are major European languages.

The gap between languages with plenty of data available on the Internet and those without is ever increasing. Pre-trained language models in recent times have led to significant improvement in various NLP tasks and Transfer Learning is rapidly changing the field. While leading architectures for pre-training models for Transfer Learning in NLP are freely available for use, most are data-hungry. The GPT-2 model, for instance, used millions, possibly billions of text to train. (ref)

The only way I know how to begin closing this gap is by creating, uncovering and collating datasets for low resource languages. With the AI4D – African Language Dataset Challenge, we want to spur on some groundwork. While Deep Learning techniques now make it possible to dream of a future where NLP researchers and practitioners on the continent can easily innovate in the languages their communities speak, a future where literacy and mastery of a major European language is no longer a prerequisite to participation in the digital economy, these techniques require data. Data that can only be created by the communities that speak these languages, by individuals that have the technical skills, by those of us who understand the importance of this work and have the desire to undertake it.

The challenge will run for 5 months(November 2019 to March 2020), with cash prizes of USD 500 awarded as an incentive to the top 2 submissions each month. This is the first of a two-phase challenge. In this first phase, the creation of datasets. We would like to see some of these datasets developed for specific downstream tasks but this is not necessary. 

We have however earmarked four downstream NLP tasks and anticipate that one(or two) of these will be the framing of the second phase of this challenge; Sentence Classification, Sentiment Analysis, Question Answering and Machine Translation. Other downstream tasks that participants may be interested in developing datasets for, or have already developed datasets for, are also eligible. Our intention is that the datasets are kept free and open for public use under a Creative Commons license once the challenge is complete.

The challenge is hosted on Zindi, head on over to this page for full details, the prize money provided through a partnership between the International Development Research Centre (IDRC) and the Swedish International Development Cooperation Agency (SIDA), the facilitation of the challenge through combined efforts of the Artificial Intelligence for Development Network(AI4D-Africa) and the Knowledge 4 All Foundation(K4All), and finally, our expert panel that have volunteered their time to undertake the difficult qualitative aspect of dataset assessment; Jade Abbott – RetroRabbit, John Quinn – Google AI/Makerere University, Kathleen Siminyu – AI4D-Africa, Veselin Stoyanov – Facebook AI and Vukosi Marivate – University of Pretoria. 

The rest, we leave up to the community.  

Contribution by Kathleen Siminyu, AI4D-Africa Network Coordinator

Photo by Eva Blue on Unsplash.


Se lancer dans la programmation est facile, c’est un chemin bien balisé. Qu’il s’agisse de l’acquisition de la compétence elle-même, un nouveau langage de programmation ou vous aventurer dans un nouveau domaine, tel que le traitement du langage naturel (NLP), vous pouvez être sûr qu’il existe une variété de tutoriels pour débutants pour vous aider à démarrer. Les “Hello World!”, Comme vous les connaissez peut-être.

 

En ce qui concerne le traitement des langages (NLP) , certains chemins ont tendance  à être mieux balisés que d’autres. Par exemple en analyse sentimental, il est beaucoup plus facile d’accomplir une tâche de NLP  que de faire de même dans ma langue maternelle, Luhya. Cette réalité est une extrapolation du fait que les langues de l’économie numérique sont en majeur partie des  langues européennes.

L’écart entre les langues contenant beaucoup de données disponibles sur Internet et celles qui n’en possèdent pas ne cesse de se creuser. Les modèles linguistiques pré-entraînés  de ces dernières années ont conduit à une amélioration significative de diverses tâches du traitement des langages (NLP) et l’apprentissage par transfert (Transfer Learning) change rapidement le domaine. Bien que les principales architectures pour les modèles de pré-entraînés  à l’apprentissage par transfert en NLP soient librement utilisables, la plupart ont besoin de beaucoup de données. Le modèle GPT-2, par exemple, utilise des millions, voire des milliards de textes pour apprendre . (ref)

La seule façon pour moi de commencer à combler cet écart consiste à créer, à découvrir et à assembler des ensembles de données pour des langages disposant de peu de ressources. Avec le défi AI4D – Jeu de données sur les langues africaines, nous souhaitons stimuler le travail préparatoire. Bien que les techniques d’apprentissage en profondeur permettent désormais de rêver d’un avenir où les chercheurs et les praticiens en NLP  du continent pourront facilement innover dans les langues parlées par leurs communautés, un avenir où l’alphabétisation et la maîtrise d’une grande langue européenne n’est plus une condition préalable à la participation à la l’économie numérique, ces techniques nécessitent des données. Des données qui ne peuvent être créées que par les communautés qui parlent ces langues, par des personnes possédant les compétences techniques, par ceux d’entre nous qui comprenons l’importance de ce travail et qui souhaitent le faire.

Le défi durera 5 mois (de novembre 2019 à mars 2020), avec des prix en espèces de 500 USD attribués sous forme d’encouragement aux 2 meilleurs projets chaque mois. C’est le premier d’un défi en deux phases. Dans cette première phase, la création de jeux de données. Nous aimerions voir certains de ces jeux de données développés pour des tâches spécifiques en aval, mais ce n’est pas nécessaire.

Nous avons toutefois réservé quatre tâches du NLP  en aval et prévoyons qu’une (ou deux) d’entre elles constitueront le cadre de référence de la deuxième phase de ce défi. Classification de textes , analyse des sentiments, réponses aux questions et traduction automatique. Les autres tâches en aval pour lesquelles les participants pourraient  être intéressés par le développement de jeux de données ou pour lesquels ils ont déjà développé des jeux de données sont également éligibles. Notre intention est que les jeux de données restent libres et ouverts au public sous une licence “Creative Commons” une fois le challenge terminé.

Le défi est hébergé sur Zindi, rendez-vous sur cette page pour obtenir tous les détails, l’argent du prix fourni grâce au partenariat entre le Centre de recherches pour le développement international (CRDI) et l’Agence suédoise de coopération pour le développement international (SIDA), la facilitation du défi par les efforts combinés du réseau de l’intelligence artificielle pour le développement (AI4D-Africa) et de la fondation Knowledge 4 All (K4All), et enfin de notre groupe d’experts qui ont offert de leur temps pour aborder le difficile aspect qualitatif de l’évaluation d’un jeu  de données; Jade Abbott – RetroRabbit, John Quinn – Google AI / Université Makerere, Kathleen Siminyu – AI4D-Africa, Veselin Stoyanov – Facebook AI et Vukosi Marivate – Université de Pretoria.

Le reste, nous laissons à la communauté.

Contribution de Kathleen Siminyu, Coordinatrice du réseau AI4D-Africa

Photo par Eva Blue sur Unsplash.

 

Isaac Rutenberg, Strathmore University on development of AI in Africa

Isaac Rutenberg, Strathmore University at workshop "Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa", Nairobi, Kenya, April 2019
Isaac Rutenberg, Strathmore University at the workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on?

My name is Issac Rutenberg, I am the director of the centre for intellectual property and information technology law, CIPIT, at the Strathmore law school, here in Nairobi, Kenya. We are working at the intersection of intellectual property and IT, particularly in the ways that people utilize both of those for various reasons, including development.

How do you perceive development and Artificial Intelligence?

I think at the moment it is quite early, there are some very nascent projects in AI on the continent, there is actually quite a lot of them. I think that the impact of those so far has been quite minimal.

I think that we are at an early stage of determining how we want to use AI. In some ways that is really good, because the rest of the world has shown us, or has allowed us to see some of the pitfalls, some of the major problems that we are going to encounter as we develop AI, we will encounter that in everyday life on a regular basis. We do already in some instances but it’s only going to grow.

What is your blue sky project in Africa?

If I could have AI solve any problems, it would be getting products to international markets. A lot of agriculture in Africa is wasted for variety of reasons, I know a lot of those are structural and AI is obviously not going to solve all of them, but somehow if we could use AI to help the distribution systems, the analysis of all of the data that is required or there is generated, that impacts how products are moved around. I think that would have a very big impact on the people in their daily lives.

 

Kathleen Siminyu from Africa’s Talking on women in African AI

Play the video by Kathleen Siminyu, Africa’s Talking at the workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on?

My name is Kathleen Siminyu, my background is in math and computer science, and from there I’ve put it into data science. So, I am a data scientist at a company called Africa Talking. That is kind of a job that pays the bills, but I wear a couple of other hats. I do a lot of work with building machine learning communities, so I run the Nairobi Women in Machine Learning and Data Science Community here in Nairobi. Then I also work with Deep Learning Indaba which is a wide organization that works with communities across the continent. Okay. AI and development.

How do you perceive development and Artificial Intelligence?

I think particularly in Africa we have a lot of problems, so there is a lot of development to be done. There are the routes that have been set, like industrialization is how countries come up, then AI brings a whole other aspect, which is how we’ve ended up in this age of Artificial Intelligence.

I think it gives us opportunities to transform a lot of things, and not necessarily follow the path that is set out by how other societies and countries have come up. I am really excited about AI and development. I think that the fact that there is a need for development that makes AI even that more exciting for us to be applying.

What is your blue sky project in Africa?

Well, my pet project at the moment is NLP. I am just going to go with that. The reason I think NLP for African languages is very important, it gives you the ability to reach the individual. I could be here with all my English, but I am not the average African.

The average African is in a village somewhere and they speak their mother tongue and they can communicate and they can function in their life with that. But this average African is not able to participate in the digital economy. And it is not because they are stupid, they may be illiterate, but they can speak, they can understand, they can think.

If we could just talk to them then I think the opportunities are limitless. It’s not that the technology does not exist, because it does. We have Siri, and you give it a command, you ask it a question and it answers you. The technology exists, we just need it to be applied in this context. Once we have that, then we can go into healthcare, we can go into education, we can go into agriculture. So much opportunity.

I think language is the first thing which we need to crack. So, I’d say, let’s unlock language and then unlock Africa.

 

Prateek Sibal from UNESCO and policy in Artificial Intelligence

Prateek Sibal, UNESCO at workshop "Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa", Nairobi, Kenya, April 2019
Prateek Sibal, UNESCO at the workshop “Toward a Network of Excellence in Artificial Intelligence for Development (AI4D) in sub-Saharan Africa”, Nairobi, Kenya, April 2019

What are you working on at the moment?

My name is Prateek, I am a policy researcher, I studied economics and public policy and now I work at the intersection of technology, policy and society. Some of the things that we are trying to understand are how is technology influencing human rights, access to information, openness about the information and how is the governance about AI and other emerging technologies changing in the world.

How do you perceive development and Artificial Intelligence?

I think it’s rather interesting the way you put it how AI is powering development and how development is using AI, I think it’s both ways. But at the heart of the issue is people. We have to be cognizant that there is a significant digital divide in this world and there are a lot of people who are still not online.

Even as we talk about development in the discussions that we had today, there are so many issues that emerge, we talk about online learning, but the internet is so expensive in some countries, so they have to use WhatsApp.

There are very fundamental challenges in development that we need to address, along with communities and being informed by their way of doing things. I think that is super important as we go ahead framing the AI and development agenda.

What is your blue sky project in Africa?

I think human capacity is something which I believe we all need to focus on, and there is so much need in developing countries to breach that divide, to build capacity to help governments to shape policies to support research centres and it will have a ripple effect. This is something which cannot happen overnight and hence building developing capacities is the key, I think if we were to go forward in this.