Visualizing public data sets with Python and ElasticSearch
I needed a big data set so I could learn how we could load and visualize data using Elasticsearch and Kibana. There is a sample public data and sample scripts and dashboards available in the elastic GitHub repository. The "donors choice" data set is 150MB compressed and 7-9GB when expanded and indexed into Elasticsearch. The repository is old so I updated the donors choice configuration and script files. You can find the code updated for Elasticsearch 7.8 in my GitHub:
- https://github.com/freemansoft/examples/tree/master/Common%20Data%20Formats
6,200,000 Records over 7 GB indexed
The final Donor's Choice data set flattens into 6.2 million records. The source data is contained in 3 compressed files which are then decompressed and massaged into a flattened (de-normalized) form and loaded into ElasticSearch. That index is somewhere between 7.5GB - 9GB. Our Kibana dashboard provides visualization insights into that data.
Video Talk
This walk through briefly shows the steps for manipulating the data and then visualizing using a Kibana dashboard. This is not a Kibana visualization guide. I just show a previously configured dashboard.
Based on
-
Elasticsearch examples project containing Python scripts for transforming the data and ingesting it into Elasticsearch
https://github.com/elastic/examples/tree/master/Exploring%20Public%20Datasets/donorschoose - The Donor Choose Dataset is available her
https://github.com/BeelGroup/Augmented-DonorsChoose.org-Dataset/releases -
I used this site to help me set up my Anaconda / Juypter notebook server
https://gist.github.com/kauffmanes/5e74916617f9993bc3479f401dfec7da anaconda
System Requirements
Loading Data
- Clone the repo or pull the directory from https://github.com/freemansoft/examples/tree/master/Exploring%20Public%20Datasets/donorschoose
- Maybe someday this will get merged with the elastic examples
- Run Elasticsearch and Kibana in docker if you don't already have it working.
- Open a linux terminal
- cd into the Donors Choice directory. You should see docker-compose.yml
- docker-compose up
- Download the data, run the Python script and index in Elasticsearch. This may take quite some time
- Open a linux terminal
- cd into the scripts directory
- 1-download-and-index.bash
- Open a browser to the Kiban URL and verify the index
Effect of Parallel processing on data prep time
Timings for sections prior to Parallel conversion . Excludes Elasticsearch Index time | Timings for sections after to Parallel conversion . Excludes Elasticsearch Index time |
- 920sec --> 169sec Grouping
- 1450sec --> 380sec Date Manipulation
- 8223sec --> 3686sec Elasticsearch Indexing
Blogs in this set
- https://joe.blog.freemansoft.com/2020/08/visualizing-public-data-sets-with.html
- https://joe.blog.freemansoft.com/2020/08/visualizing-donors-choose-data-set-with.html
- http://joe.blog.freemansoft.com/2020/08/python-multiprocessingpool-improvement.html
Load previously generated Kibana Visualizations
OS Tuning including WSL
sudo sysctl -w vm.max_map_count=262144
See https://github.com/freemansoft/docker-scripts/tree/main/elasticsearch for how this can be manged with WSL instances
Helpful Docker Commands
- docker-compse down
- docker volume prune
Windows 10 WSL File System Hints
- \\wsl$
Comments
Post a Comment