Visualizing public data sets with Python and ElasticSearch

I needed a big data set so I could learn how we could load and visualize data using Elasticsearch and Kibana. There is a sample public data and sample scripts and dashboards available in the elastic GitHub repository. The "donors choice" data set is 150MB compressed and 7-9GB when expanded and indexed into Elasticsearch. The repository is old so I updated the donors choice configuration and script files. You can find the code updated for Elasticsearch 7.8 in my GitHub:

  • https://github.com/freemansoft/examples/tree/master/Common%20Data%20Formats

6,200,000 Records over 7 GB indexed

The final Donor's Choice data set flattens into 6.2 million records. The source data is contained in 3 compressed files which are then decompressed and massaged into a flattened (de-normalized) form and loaded into ElasticSearch. That index is somewhere between 7.5GB - 9GB.  Our Kibana dashboard provides visualization insights into that data.

Video Talk

This walk through briefly shows the steps for manipulating the data and then visualizing using a Kibana dashboard. This is not a Kibana visualization guide. I just show a previously configured dashboard. 


Based on 

My work is based on the following

System Requirements

You will probably need a 32GB system to load and prep the data judging from my Windows 10 WSL VM size I ran this on a 64GB server I got off craigslist.  The actual index is much smaller, approximately 7.5GB.

Loading Data

This assumes you are running linux , Mac or Linux in a Windows WSL2 subsystem.  All commands are bash prompt commands.
  1. Clone the repo or pull the directory from https://github.com/freemansoft/examples/tree/master/Exploring%20Public%20Datasets/donorschoose 
    1. Maybe someday this will get merged with the elastic examples
  2. Run Elasticsearch and Kibana in docker if you don't already have it working.
    1. Open a linux terminal 
    2. cd into the Donors Choice directory. You should see docker-compose.yml
    3. docker-compose up
  3. Download the data, run the Python script and index in Elasticsearch.  This may take quite some time
    1. Open a linux terminal 
    2. cd into the scripts directory
    3. 1-download-and-index.bash
  4. Open a browser to the Kiban URL and verify the index

Effect of Parallel processing on data prep time

These two pictures show the before and after times for the data preparation stages.

Timings for sections prior to Parallel conversion . Excludes Elasticsearch Index time Timings for sections after to Parallel conversion . Excludes Elasticsearch Index time
Implementing two parallel changes plus the Elasticsearch parallel load API knocked 105 minutes off the wall time for preparation and indexing on a 2Ghz machine
  • 920sec --> 169sec  Grouping
  • 1450sec --> 380sec Date Manipulation
  • 8223sec --> 3686sec Elasticsearch Indexing
6358 seconds saved or 105 minutes saved

Blogs in this set

Load previously generated Kibana Visualizations

See the related blog articles

OS Tuning

You will should not need any tuning if you run a single node Elasticsearch cluster. I found I did need to change the Linux memory configuration if I ran a 3 node Elasticsearch cluster.
sudo sysctl -w vm.max_map_count=262144

Helpful Docker Commands

You can remove the Elasticsearch data volume with the following command.  I did this when I wanted to restart from a zero data state.
  • docker-compse down
  • docker volume prune

Windows 10 WSL Hints

You can see files in the WSL Linux file system via this file share.
  • \\wsl$

Comments

Popular Posts