Streaming is Micro-Batching when feeding a Data Lake

July 05, 2020

Streaming data in Data Lakes is one of those ideas that doesn't work exactly as you expect it to. Your pretty data streams are packed up and sent into your data lake by old school batch process.

The primary reason for this is the way that data is stored in the cloud in non-append-able file/blob storage. Data files cannot have data appended to them. New files/objects must continually be generated for any kind of producer performance.

Streams are Micro Batched

Groups of records are written as a single operation to cloud storage. Data lake writers must batch up streaming data in order to write it to cloud storage. Writing individual records is inefficient and may not even be possible depending on the run rate of the message stream. Files in the Data Lake in the same dataset may be different sizes based on the batch writing trigger definitions.

A streaming Lake Writer collects incoming messages and micro-batches them into blocks that are written to non-append-able cloud storage. Writes are commonly triggered based on time or record count.

Multi-Channel Streams and Micro Batches

Streaming systems increase performance through partitioning and sharding. They achieve virtually linear scale increases in the number of messages handled, throughput and the amount of data that can be stored.

Shard Writers

This diagram shows two streams that are fed into the same Lake table. The data from the stream partitions are independently written into cloud storage. This may result in more files based on volume. This is transparent to consumers.

Aggregated Writers

This diagram shows two streams that are fed into the same Lake table. The data from the stream partitions maybe intermixed when written into cloud storage. This may result in fewer larger files as the streams are aggregated.

Latency vs Batch Size

There is a design trade off between file efficiency / size and the latency of when data will become visible to consumers. The micro-batch writer may have to determine when to write the micro-batch to blob store based on the amount of time since the last batch PUT and the number of streamed items.

Short Video

Created 7/2020

Blog de Joe Freeman