DASK so cool - Faster data analytics with Python and DASK - Run locally with Docker

Data scientists and others want to analyze and manipulate lots of data without having to be Software developers. Jupyter notebooks, Python and no DASK are vehicles that help make that happen.

DASK is a distributed Python library that can run in pretty much any Python environment.  It was designed to help data scientists scale up their data analysis by providing an easy to use distributed compute API and paradigm. DASK/Python can be run both inside Jupyter notebooks and as standalone Python programs.

Environment

The DASK environment consists of a Dask Scheduler and any number of worker nodes. The Python program sends a set of tasks to the Dask Scheduler which then distributes those across the worker nodes.  The worker node results are then aggregated  and returned to the original program.

Docker

The DASK team provides samples how to use docker-compose to create a DASK development and execution environment The environment consists of
  • A development node with Python and Jupyter Notebook server
  • A DASK scheduler
  • DASK worker nodes
I have created a docker-compose.yml file  that mounts persistent storage on the development node. This makes it easy to keep work across container deployments.  My docker-compose.yml file is derived from the DASK team's examples which do not have persistent storage, at this time.

Local DASK docker cluster deployed in Docker


















Quickstart DASK Python Code

See this docker-scripts repository on GitHub
  1. Clone the repository, 
  2. cd into dask
  3. run docker-compose up.
  4. Run this sample code

from dask.distributed import Client
client = Client('scheduler:8786')
def square(x):
    return x ** 2
def neg(x):
    return -x

# this will run two different distributed operations
A = client.map(square, range(100))
B = client.map(neg, A)
total = client.submit(sum, B)
total.result()
client.gather(A)

DASK in Action

The demonstration code for the video came from distributed.dask.org quickstart
DASK with Jupyter notebooks running locally on Docker


DASK with VSCode running locally on Docker


Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk