This research focuses on enhancing Pharmacovigilance Systems using Natural Language Processing on Electronic Medical Records (EMR). Our major task was to develop an NLP model for extracting Adverse Drug Reaction(ADR) cases from EMR. The team was required to collect data from two hospitals, which are using EMR systems (i.e. University of Dodoma (UDOM) Hospital and Benjamin Mkapa (BM) Hospital). During data collection and analysis, we worked with health professionals from the two mentioned hospitals in Dodoma. We also used the public dataset from the MIMIC-III database. These datasets were presented in different formats, CSV for UDOM hospital and MIMIC III and PDF for BM hospital as shown on the attached file.
In most cases, pharmacovigilance practices depend on analyzing clinical trials, biomedical writing, observational examinations, Electronic Health Records (EHRs), Spontaneous Reporting (SR) and social media (Harpaz et al., 2014). As to our context, we considered EMR to be more informative compared to other practices, as suggested by (Luo et al., 2017). We studied schemas of EMRs from the two hospitals. We collected inpatients’ data since outpatients’ would have given the incomplete patient history. Also, our health information systems are not integrated, which makes it difficult to track patients’ full history unless patients were admitted to a particular hospital for a while. From all the data sources used there was a pattern of information that we were looking for, and this included clinical history, prior patient history, symptoms developed, allergies/ ADRs discovered during medication and patient’s discharge summary.
Much as we worked on UDOM and BM hospitals’ data, we encountered several challenges that made the team focus on MIMIC-III dataset while searching for an alternative way to our local data. Here were the challenges noted:
- The reports had no clear identification of ADR cases.
- In most cases, the doctor did not mention the reasons for changing a medicine on a particular patient which made it hard to understand whether the medication didn’t work well for a specific patient or any other reasons like adverse reaction.
- The justification for ADR cases was vague
- Mismatch of information between patients and doctors
- The patients talk in a way that doctor can’t understand
- There is a considerable gap between the health workers and regulatory authorities (They don’t know if they have to report for ADR cases)
- The issue of ADR is so complex since there is a lot to take into account like Drug to Drug, Drug to food and Drug to herbal interactions.
- There was no common/consistent reporting style among doctors
- The language used to report is hard for a non-specialist to understand.
- Some fields were left empty with no single information which led to incomplete medical history
- The annotation process prolonged since we had one pharmacologist for the work.
After noticing all these challenges, the team carefully studied the MIMIC-III database to assess the availability of the data with ADR cases which would help to come up with the baseline model to the problem. We discovered that the NoteEvent table has enough information about the patient history with all clear indications of ADR cases and with no ADR see the text.
To start with, we were able to query 100,000 records from the database with many attributes, but we used a text column found in the NoteEvent table with the entire patient’s history including (patient’s prior history, medication, dosage, examination, changes noted during medications, symptoms etc.). We started the annotation of the first group by filtering the records to remain with the rows of interest. We used the following keywords in the search; adverse, reaction, adverse events, adverse reaction and reactions. We discovered that only 3446 rows contain words that guided the team in the labelling process. The records were then annotated with the labels 1 and 0 for ADR and non-ADR cases respectively, as indicated in the filtration notebook.
In analysing the data, we found that there were more non-ADR cases than ADR cases, in which non-ADR cases were 3199 and 228 ADR cases and 19 data rows not annotated. Due to high data imbalance, we reduced Non-ADR cases to 1000, and we applied sampling techniques (i.e upsampling ADR cases to 800) to at least balance the classes to minimize bias during modelling.
After annotation and simple analysis we used NLTK to apply the basic preprocessing techniques for text corpus as follows:-
- Converting the corpus-raw sentences to lower cases which helps in other processing techniques like parsing.
- Sentence tokenization, due to the text being in paragraphs, we applied sentence boundary detection to segment text to sentence level by identifying sentence starting point and endpoint.
Then we worked with regular contextual expressions to extract information of interest from the documents by removing some of the unnecessary characters and replacing some with easily understandable statements or characters as for professional guidelines.
We removed affixes in tokens which put words/tokens into their root form. Also, we removed common words(stopwords) and applied lemmatization to identify the correct part of speech(s) in the raw text. After data preprocessing, we used Term Frequency Inverse Document Frequency (TF-IDF) from scikit-learn to vectorize the corpus, which also gives the best exact keywords in the corpus.
In modelling to create a baseline model, we worked with classification algorithms using scikit-learn. We trained six different models which are Support Vector Machines, eXtreme Gradient Boosting, Adaptive Gradient Boosting , Decision Trees, Multilayer Perceptron and Random Forest and then we selected three (Support Vector Machine, Multilayer Perceptron and Random Forest )models which performed better on validation compared to other models for further model evaluation. We’ll also use the deep learning approach in the next phase of the project to produce more promising results for the model to be deployed and kept in practice. Here is the link to colab for data pre-processing and modelling.
From the UDOM database, we collected a total of 41,847 patient records in chunks of 16185, 18400, and 7262 from 2017 to 2019 respectively. The dataset has following attributes (Date, Admission number, Patient Age, Sex, Height(Kg), Allergy status, Examination, Registration ID, Patient History, Diagnosis, and Medication ), we downsized it to 12,708 records by removing missing columns and uninformative rows. We used regular contextual expressions to extract information of interest from the documents as for professional guidelines. The data cleaned and exchanged data formatting, analyzing and preparing data for machine learning was elaborated in this Colab link.
On the BM hospital, the PDF files extracted from EMS have patient records with the following information.
- Discharge reports
- Medical notes
- Patients history
- Lab notes
Health professionals on the respective hospitals manually annotated the labels for each document, and this task took most of our time in this phase of the project. We’re still collecting and interpreting more data from these hospitals.
The team organizes and extracts information from BM hospital PDF files by exchanging data formatting, analyzing and preparing data for machine learning. We experimented with OCR processing for PDF files to extract data, but we didn’t generate promising results as more information appeared to be missing. We therefore hard to programmatically remove content from individual files and align them to the corresponding professional provided labels.
The big lesson that we have learned up to now is that most of the data stored in our local systems are not informative. Policymakers must set standards to guide system developers during development and health practitioners when using the system.
Lastly but not least, we want to thank our stakeholders, mentors and funders for your involvement in our research activities. It is because of such a partnership we can be able to achieve our main goal.