Building a Data Pipeline for a Real World Machine Learning Application

We set out with a novel idea; to develop an application that would (i) collect an individual’s Blood Pressure (BP) and activity data, and (ii) make future BP predictions for the individual with this data.

Key requirements for this study therefore were;

  1. The ability to get the BP data from an individual.
  2. The ability to get a corresponding record of their activities for the BP readings.
  3. The identification of a suitable Machine Learning (ML) Algorithm for predicting future BP.

Pre-test the idea – Pre testing the idea was a critical first step in our process before we could proceed to collect the actual data. The data collection process would require the procurement of suitable smart watches and the development of a mobile application, both of which are time consuming and costly activities. At this point we learnt our first lessons; (i) there was no precedence to what we were attempting and subsequently (ii) there were no publicly available BP data sets available for use in pre-testing our ideas.

Simulate the test data – The implication therefore was that we had to simulate data based on the variables identified for our study. The variables utilized were the Systolic and Diastolic BP Reading, Activity and a timestamp. This was done using a spreadsheet and the data saved as a comma separate values (csv) file. The csv is a common file format for storing data in ML.

Identify a suitable ML model – The data simulated and that in the final study was going to be time series data. The need to predict both the Systolic and Diastolic BP using previous readings, activity and timestamps meant that we were was handling a multivariate time series data. We therefore tested and settled on an LSTM model for multivariate time series forecasting based on a guide by Dr Jason Browniee (https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/)

Develop the data collection infrastructure – There being no pre-existing data for the development implied that we had to collect our data. The unique nature of our study, collecting BP and activity data from individuals called for an innovative approach to the process.

  • BP data collection – for this aspect of the study we established that the best way to achieve this would be the use of smart watches with BP data collection and transmission capabilities. In addition to the BP data collection, another key consideration for the device selection was affordability. This was occasioned both by the circumstances of the study, limited resources available and more importantly, the context of use of a probable final solution; the watch would have to be affordable to allow for wide adoption of the solution.

The watch identified was the F1 Wristband Heart and Heart Rate Monitor.

  • Activity data collection – for this aspect of the study a mobile application was identified as the method of choice. The application was developed to be able to receive BP readings from the smart watch and to also collect activity data from the user.

Test the data collection – The smart watch – mobile app data collection was tested and a number of key observations were made.

  • Smart watch challenges – In as much as the watch identified is affordable it does not work well for dark skinned persons. This is a major challenge given the fact that a majority of people in Kenya, the location of the study and eventual system use, are dark skinned. As a result we are examining other options that may work in a universal sense.
  • Mobile app connectivity challenges – The app initially would not connect to the smart watch but this was resolved and the data collection is now possible.

Next Steps

  • Pilot the data collection – We are now working on piloting the solution with at least 10 people over a period of 2 – 3 weeks. This will give us an idea on how the final study will be carried out with respect to:
  1. How the respondents use the solution,
  2. The kind of data we will be able to actually get from the respondents
  3. The suitability of the data for the machine learning exercise.
  • Develop and Deploy the LSTM Model – We shall then develop the LSTM model and deploy it on the mobile device to examine the practicality of our proposed approach to BP prediction.