The Data Without Borders class was described as “Data science in the service of humanity.” It was led by the amazingly talented and enthusiastic Jake Porway, founder of DataKind. The main technical focus of the class was learning to use the software R Studio to analyze data. DataKind connects data analysts and visualizers with not-for-profits that have a lot of data, but don’t know what to do with it.
I took this class in large part because of my Databetes project and an interest in learning about ways of analyzing all the data I am accumulating. Quite obviously, I chose some of my medical data to analyze for the final project. I looked at one month’s worth of blood sugar readings from my Dexcom continuous glucose monitor (CGM). My goal was to look beyond the ways that this data is normally analyzed based on the day’s average blood sugar. Instead I wanted to look at the volatility of the readings. As a patient, I know there are days with great average blood sugar readings but big swings from high to low blood sugars. On those types of days, you definitely don’t feel like you are achieving good control, even though your average daily blood sugar reading says you are. This project gave me the opportunity to explore ways of spotting those days, measure the volatility and creating a metric for determining acceptable/standard versus troubling volatility.
The problem with the CGM data however is that often the time series data is not complete. The CGM is designed to generate a reading every 5 minutes. However, there is a limited transmission range between the sensor/transmitter and the receiver. If a reading is transmitted but not received, it is lost forever. This tends to happen several times a day. Also, occasionally the Dexcom gives me a “???” error on the receiver, most often during the first day of using a sensor (sensors need to be replaced once a week). Finally, when putting a new sensor, there is a three hour window where the device does not transmit data.
I began by trying to address these holes in the data and think of a way to address the problem. The deeper we looked into the problem however, we thought of other issues. Every time I put in a new sensor, it starts transmitting at a time out of sync with the previous sensor. For example, if the sensor for the first week of January transmits at ##:02:10 and ##:07:10, the sensor for the second week could transmit at ##:04:51 and ##:09:51. Since any analysis would need to specify the time series variable down to the second, I would need to round the numbers in one direction or another. I spoke with Jake about the ways of doing this in R and the conversation quickly became complicated. He was worried that I would spend so much time cleaning up the data that I would spend far too little time after doing the actual analysis.
As a solution, Jake suggested I find the biggest blocks of complete data and do the analysis within them. I found six blocks that had about a day’s worth of complete data each.
I generated three reports for each of these data sets.
The first (top) charts blood sugars directly. The second (bottom left) provides a slight smoothing to the readings using the TTR library.
smoothedA <- SMA(dexcomA$Value)
The final (bottom right) chart gives the volatility that I was after. I experimented with different lag values and in the end decided on 10. That compares point 1 with 11 in the readings.
changesC <- diff(smoothedC, lag=10)
The greatest volatility I saw in these readings was around 60 points within the hour time frame that I compared. In general, my January readings were quite good (for those of you familiar with diabetes, my HbA1C for this time period was 5.6), so this can be considered a reasonable level of volatility.
It seems like the best way to approach this problem may involve using other programming languages to clean up the data and make it easier to assess volatility. This is a problem that I will return to in the coming months as I prepare my thesis on Databetes. Nonetheless, this was an interesting start to understanding some of the issues behind data analysis. R is a nice break from programs like Processing because it can achieve many different conclusions and generate several standard visualizations with very little code. I expect that I will use it again in the future to first see what it is that I want to visualize, then return to Processing to make a more compelling design.
For those interested, this is the full presentation I made on the final day of class.