Introduction

Given the challenges and precariousness facing developing and underdeveloped countries, the quality of policy- making and legislation is of enormous importance. This legislation can be used to impact the success of some of the United Nations Sustainable Development Goals (SDGs) like poverty alleviation, good public health system, quality education, economic growth and, sustainability. Targets 16.6 and 16.7 from the UN SDGs is to “develop effective, accountable, and transparent institutions at all levels” and to “ensure responsive, inclusive, participatory and representative decision making at all levels” [2]. For countries in Sub-Saharan Africa to meet this target, an open data revolution needs to happen at all levels of government and more importantly, at the parliamentary level. 

Objectives and Expectations

To achieve the goal of meeting the UN SDG targets 16.6 & 16.7, making effective use of data is key. However, does such data currently exists? If so, how should it be organized in a framework that is amenable to decision- making process? Here, we propose expanding our work on categorizing parliamentary bills in Nigeria using Optical Character Recognition (OCR), document embedding and recurrent neural networks to three other 

countries in Africa: Kenya, Ghana, and South Africa. We also plan to improve our text extraction process by training a custom OCR using AI. The objective of this project is to generate semantic and structured data from the bills and in turn, categorize them into socio-economic driven labels. We plan to recruit three interns to work on this project for five months: two machine learning and one software engineering interns. We have the following expectations and deliverables for this project: 

  • An interactive web application that contains all parliamentary bills. It will be publicly available. 
  • Data and code for the project.
  • A manuscript for submission to the United Nations’s AI for Good Global Summit (2020).

Preliminary Results

We are currently preparing a manuscript based on our initial results, and we plan to submit it to the Machine Learning for the Developing World (ML4D) workshop at the 2019 Neural Information Processing Systems con- ference (O. Wahab and A. Akinfaderin. NASS-AI: Towards Digitization of Parliamentary Bills using Document Level Embedding and Bidirectional Long Short-Term Memory. Under preparation). In most developed coun- tries, legal texts, parliamentary bills and court documents are available as structured corpus [5]. However, due to inefficiency and a lack of a proper data management system, these structured corpus are difficult to get in underdeveloped and some developing countries. Researchers in developing countries like Brazil recently used machine learning to classify legal documents which enables them to unclog the overloaded judicial system [3]. In this project, we present results from ongoing research on the categorization of bills introduced in the Nigerian parliament since the fourth republic (1999 – 2018). For this task, we employed a multi-step approach which 

involves extracting text from low quality scanned pdfs using OCR tools and labeling them into eight categories. We investigate the performance of document level embedding [6] for feature representation of the extracted texts before using a Bidirectional Long Short-Term Memory (Bi-LSTM) for our classifier [4]. The performance was further compared with other feature representation and machine learning techniques. 

Using our scraped dataset, we hand-labeled each bill into different categories. We initially started with 18 classes before reducing the number of classes to eight to enable us capture a short but accurate representation of the different facets of governance. The eight classification groups shown with the frequency of the classes in Figure 1 are: laws, civil rights, safety and security; education, research and technology; government opera- tion and international affairs; health and agriculture; labour, sports and social welfare; trade, commerce and macroeconomics; energy, environment and natural resources; public land, housing and transportation. Table 1 highlights the precision, recall, and F1-score for each class from our experiment. We also presented our model architecture and the normalized confusion matrix that shows the actual and predicted labels in Figure 2. Our model performed very well on Health and Agriculture bill category but least well on the public land, housing and transportation bill category. The overall macro-average F1-score from our evaluation on the test set is 71%. In addition to this, we used other feature representation methods and machine learning techniques as our baseline. Our experiments show that using document level embedding with a BiLSTM classifier outperform other methods. 

Conclusions and Long Term Vision

Our initial experimental results show that our model is effective for categorizing the bills which will aid our large scale digitization efforts. However, we identified a key remaining challenge based on our results. The output from the pre-trained OCR tool is not generally a very accurate representation of the text in the bills, especially for the low-quality PDFs. A fascinating possibility is to solve this by training our custom OCR which we proposed. The intensive acceleration of text detection research with novel deep learning methods can help us in this area. Methods such as region-based or single-shot based detectors can be employed. In addition to this, we plan to use image augmentation to alter the size, background noise or color of the bills. A large scale annotation effort of the texts can be as the labels for us to train our custom OCR for text identification and named entity recognition. We are also extending our methodology to other countries in Sub-Saharan Africa. 

Results that lead to accurate categorization of parliamentary bills are well-positioned to have a substantial impact on governmental policies and on the quest for governments in low resource countries to meet the open data charter principles and United Nation’s sustainability development goals on open government. Also, it can empower policymakers, stakeholders and governmental institutions to identify and monitor bills introduced to the National Assembly for research purposes and facilitate the efficiency of bill creation and open data initiatives. We plan to design an intercontinental tool that combines information from all bills and categories and make them easily accessible to everyone. For our long term vision, we plan to analyze documents on parliamentary votes and proceedings to give us more insight into legislative debates and patterns.

References

[1] Open data barometer by the world wide web foundation. https://opendatabarometer.org/4thedition/regional- 

snapshot/sub-saharan-africa/. 

[2] United nations sustainable development goals. https://sustainabledevelopment.un.org/sdg16. 

[3] F. A. Braz, Silva N. C., and de Campos T. E. et al. Document classification using a bi-lstm to unclog brazil’s supreme court. NeurIPS 2018 Workshop on Machine Learning for the Developing World (ML4D), Montréal, Québec, 2018. 

[4] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural 

network architectures. Neural Networks, 18:602–610, 2005. 

[5] A-L. Kalouli, L. Vrana, V. M. Fabella, L. Bellani, and A. Hautli-Janisz. Cousbi: A structured and visu- alized legal corpus of us state bills. Proceedings of the LREC 2018 “Workshop on Language Resources and Technologies for the Legal Knowledge Graph”, Miyazaki, Japan, 2018. 

[6] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of the 31st 

International Conference on Machine Learning (ICML 2014), Beijing, China, page 1188–1196, 2014.