DASK so cool - Faster data analytics with Python and DASK

DASK so cool - Faster data analytics with Python and DASK - Run locally with Docker

January 25, 2020

Data scientists and others want to analyze and manipulate lots of data without having to be Software developers. Jupyter notebooks, Python and no DASK are vehicles that help make that happen.

DASK is a distributed Python library that can run in pretty much any Python environment. It was designed to help data scientists scale up their data analysis by providing an easy to use distributed compute API and paradigm. DASK/Python can be run both inside Jupyter notebooks and as standalone Python programs.

Environment

The DASK environment consists of a Dask Scheduler and any number of worker nodes. The Python program sends a set of tasks to the Dask Scheduler which then distributes those across the worker nodes. The worker node results are then aggregated and returned to the original program.

Docker

The DASK team provides samples how to use docker-compose to create a DASK development and execution environment The environment consists of

A development node with Python and Jupyter Notebook server
A DASK scheduler
DASK worker nodes

I have created a docker-compose.yml file that mounts persistent storage on the development node. This makes it easy to keep work across container deployments. My docker-compose.yml file is derived from the DASK team's examples which do not have persistent storage, at this time.

Local DASK docker cluster deployed in Docker

Quickstart DASK Python Code

See this docker-scripts repository on GitHub.

Clone the repository,
cd into dask
run docker-compose up.
Run this sample code

from dask.distributed import Client
client = Client('scheduler:8786')
def square(x):
    return x ** 2
def neg(x):
    return -x

# this will run two different distributed operations
A = client.map(square, range(100))
B = client.map(neg, A)

total = client.submit(sum, B)
total.result()

client.gather(A)

DASK in Action

The demonstration code for the video came from distributed.dask.org quickstart

DASK with Jupyter notebooks running locally on Docker

DASK with VSCode running locally on Docker

Blog de Joe Freeman