Streaming is Micro-Batching when feeding a Data Lake

Streaming data in Data Lakes is one of those ideas that doesn't work exactly as you expect it to.  Your pretty data streams are packed up and sent into your data lake by old school batch process. 

The primary reason for this is the way that data is stored in the cloud in non-append-able file/blob storage. Data files cannot have data appended to them. New files/objects must continually be generated for any kind of producer performance. 

Streams are Micro Batched

Groups of records are written as a single operation to cloud storage.  Data lake writers must batch up streaming data in order to write it to cloud storage. Writing individual records is inefficient and may not even be possible depending on the run rate of the message stream. Files in the Data Lake in the same dataset may be different sizes based on the batch writing trigger definitions.
A streaming Lake Writer collects incoming messages and micro-batches them into blocks that are written to non-append-able cloud storage. Writes are commonly triggered based on time or record count.

Multi-Channel Streams and Micro Batches

Streaming systems increase performance through partitioning and sharding.  They achieve virtually linear scale increases in the number of messages handled, throughput and the amount of data that can be stored.  

Shard Writers

This diagram shows two streams that are fed into the same Lake table.  The data from the stream partitions are independently written into cloud storage. This may result in more files based on volume. This is transparent to consumers.

Aggregated Writers

This diagram shows two streams that are fed into the same Lake table.  The data from the stream partitions maybe intermixed when written into cloud storage. This may result in fewer larger files as the streams are aggregated.

Latency vs Batch Size

There is a design trade off between file efficiency / size and the latency of when data will become visible to consumers.  The micro-batch writer may have to determine when to write the micro-batch to blob store based on the amount of time since the last batch PUT and the number of streamed items. 

Short Video



Created 7/2020

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk