Visualizing public data sets with Python and ElasticSearch

August 10, 2020

I needed a big data set so I could learn how we could load and visualize data using Elasticsearch and Kibana. There is a sample public data and sample scripts and dashboards available in the elastic GitHub repository. The "donors choice" data set is 150MB compressed and 7-9GB when expanded and indexed into Elasticsearch. The repository is old so I updated the donors choice configuration and script files. You can find the code updated for Elasticsearch 7.8 in my GitHub:

https://github.com/freemansoft/examples/tree/master/Common%20Data%20Formats

6,200,000 Records over 7 GB indexed

The final Donor's Choice data set flattens into 6.2 million records. The source data is contained in 3 compressed files which are then decompressed and massaged into a flattened (de-normalized) form and loaded into ElasticSearch. That index is somewhere between 7.5GB - 9GB. Our Kibana dashboard provides visualization insights into that data.

Video Talk

This walk through briefly shows the steps for manipulating the data and then visualizing using a Kibana dashboard. This is not a Kibana visualization guide. I just show a previously configured dashboard.

Based on

My work is based on the following

Elasticsearch examples project containing Python scripts for transforming the data and ingesting it into Elasticsearch
https://github.com/elastic/examples/tree/master/Exploring%20Public%20Datasets/donorschoose
The Donor Choose Dataset is available her
https://github.com/BeelGroup/Augmented-DonorsChoose.org-Dataset/releases
I used this site to help me set up my Anaconda / Juypter notebook server
https://gist.github.com/kauffmanes/5e74916617f9993bc3479f401dfec7da anaconda

System Requirements

You will probably need a 32GB system to load and prep the data judging from my Windows 10 WSL VM size I ran this on a 64GB server I got off craigslist. The actual index is much smaller, approximately 7.5GB.

Loading Data

This assumes you are running linux , Mac or Linux in a Windows WSL2 subsystem. All commands are bash prompt commands.

Clone the repo or pull the directory from https://github.com/freemansoft/examples/tree/master/Exploring%20Public%20Datasets/donorschoose

Maybe someday this will get merged with the elastic examples

Run Elasticsearch and Kibana in docker if you don't already have it working.

Open a linux terminal
cd into the Donors Choice directory. You should see docker-compose.yml
docker-compose up

Download the data, run the Python script and index in Elasticsearch. This may take quite some time

Open a linux terminal
cd into the scripts directory
1-download-and-index.bash

Open a browser to the Kiban URL and verify the index

http://localhost:5601/

Effect of Parallel processing on data prep time

These two pictures show the before and after times for the data preparation stages.

Timings for sections prior to Parallel conversion . Excludes Elasticsearch Index time

Timings for sections after to Parallel conversion . Excludes Elasticsearch Index time

Implementing two parallel changes plus the Elasticsearch parallel load API knocked 105 minutes off the wall time for preparation and indexing on a 2Ghz machine

920sec --> 169sec Grouping
1450sec --> 380sec Date Manipulation
8223sec --> 3686sec Elasticsearch Indexing

6358 seconds saved or 105 minutes saved

Blogs in this set

Load previously generated Kibana Visualizations

See the related blog articles

OS Tuning including WSL

You will should not need any tuning if you run a single node Elasticsearch cluster. I found I did need to change the Linux memory configuration if I ran a 3 node Elasticsearch cluster.

sudo sysctl -w vm.max_map_count=262144

See https://github.com/freemansoft/docker-scripts/tree/main/elasticsearch for how this can be manged with WSL instances

Helpful Docker Commands

You can remove the Elasticsearch data volume with the following command. I did this when I wanted to restart from a zero data state.

docker-compse down
docker volume prune

Windows 10 WSL File System Hints

You can see files in the WSL Linux file system via this file share.

\\wsl$

Revision History

Docker VM kernel parameter tuning 6/2023

Blog de Joe Freeman