Visualizing Covid Vaccinations - Python data prep and steps
We want to plot the Covid vaccination rates across different countries world-wide or different states in the USA. We need to create a standardized dataset that is accurate enough for our graphing purposes.
The folks at Our World in Data (OWiD) gather that information to create composite data sets. Each independent entity reports data on its own schedule. The composite dataset can be missing entire days of data for some entities or individual data attributes in some of the days that are actually reported.
Lets look at the steps required to create reasonable comparisons and progress graphics.
Source Data and Code
- Dataset courtesy of Our World in Data: GitHub Repository
- Python code and scripts described here are available on GitHub
- Videos links at the bottom of this article
Data Consistency
We want time-series data that lets us exactly line up the data for each reporter.
This table shows two different countries, C1 and C2. They each report data on their own schedules.
C1 does not report all values on the days that it turns in data. In addition, there may be some type of reporting lag. They haven't reported the last day's data yet.
C2 reports all of the data when it actually reports. There are some days completely missing any information.
- Interior rows are in between other rows when sorted by time.
- Initial Rows are at the start of the time series. They may be missing prior to the time to an entity providing data.
- Final Rows exist at the end of the data set. An entity may be missing some final rows because of reporting lag or other transmission issues.
We need strategies for filling in the missing constants and numerical values at the start, end, or in the middle of a time series. The data needs to be good enough to make a smooth animated plot.
Interior Rows
- Create new time series "rows" for each missing day.
- Interpolate the numerical values based on those on both sides of the new row.
- Forward fill any constants like reporting agency, country or source web site. They tend to be the same across most of time and don't affect the plot
Initial Rows
- Create new time series rows for each missing day. Parts of these rows will be incomplete.
- Missing some numerical values could be backward extrapolated from the following rows. Instead, we left these values as NaN and exclude them from the plot.
- Missing constant or reference data can be back-filled from the first populated roll. It doesn't affect the plots and may be used as part of the legend. bfill is good enough for these plots
Trailing Rows
- Create new time series "rows" for each missing trailing day. Parts of these rows will be incomplete.
- Leave the missing values in the new rows as they are. We could forwards extrapolate from previous data but did not do that here.
- Forward fill any constants like reporting agency, country or, source web site. They tend to be the same across time and do not affect the plot
Process Flow
The Python code operates in the following sequence.
Video Walkthroughs
Data Preparation - Fixing up the data for visualization.
Python Code Walkthrough GitHub
Comparing nation performance.
Comparing state performance.
Python Source on GitHub
Python source code is available on GitHub. The repository contains a shell script that downloads the latest data and then starts a docker container that mounts the scripts and data into a Jupyter Notebook.
Comments
Post a Comment