Visualizing Covid Vaccinations - Python data prep and steps

March 14, 2021

We want to plot the Covid vaccination rates across different countries world-wide or different states in the USA. We need to create a standardized dataset that is accurate enough for our graphing purposes.

The folks at Our World in Data (OWiD) gather that information to create composite data sets. Each independent entity reports data on its own schedule. The composite dataset can be missing entire days of data for some entities or individual data attributes in some of the days that are actually reported.

Lets look at the steps required to create reasonable comparisons and progress graphics.

Source Data and Code

Dataset courtesy of Our World in Data: GitHub Repository
Python code and scripts described here are available on GitHub
Videos links at the bottom of this article

Data Consistency

We want time-series data that lets us exactly line up the data for each reporter.

This table shows two different countries, C1 and C2. They each report data on their own schedules.

C1 does not report all values on the days that it turns in data. In addition, there may be some type of reporting lag. They haven't reported the last day's data yet.

C2 reports all of the data when it actually reports. There are some days completely missing any information.

Interior rows are in between other rows when sorted by time.
Initial Rows are at the start of the time series. They may be missing prior to the time to an entity providing data.
Final Rows exist at the end of the data set. An entity may be missing some final rows because of reporting lag or other transmission issues.

We need strategies for filling in the missing constants and numerical values at the start, end, or in the middle of a time series. The data needs to be good enough to make a smooth animated plot.

The lower two tables show the data we will plot against after filling in missing time series rows or individual attributes

Interior Rows

Create new time series "rows" for each missing day.
Interpolate the numerical values based on those on both sides of the new row.
Forward fill any constants like reporting agency, country or source web site. They tend to be the same across most of time and don't affect the plot

Initial Rows

Create new time series rows for each missing day. Parts of these rows will be incomplete.
Missing some numerical values could be backward extrapolated from the following rows. Instead, we left these values as NaN and exclude them from the plot.
Missing constant or reference data can be back-filled from the first populated roll. It doesn't affect the plots and may be used as part of the legend. bfill is good enough for these plots

Trailing Rows

Create new time series "rows" for each missing trailing day. Parts of these rows will be incomplete.
Leave the missing values in the new rows as they are. We could forwards extrapolate from previous data but did not do that here.
Forward fill any constants like reporting agency, country or, source web site. They tend to be the same across time and do not affect the plot

Process Flow

The Python code operates in the following sequence.

Video Walkthroughs

Data Preparation - Fixing up the data for visualization.

Python Code Walkthrough GitHub

Comparing nation performance.

Comparing state performance.

Python Source on GitHub

Python source code is available on GitHub. The repository contains a shell script that downloads the latest data and then starts a docker container that mounts the scripts and data into a Jupyter Notebook.

Blog de Joe Freeman