Streaming is Micro-Batching when feeding a Data Lake
Streaming data in Data Lakes is one of those ideas that doesn't work exactly
as you expect it to. Your pretty data streams are packed up and sent into your data lake by old
school batch process.
The primary reason for this is the way that data is stored in the cloud in
non-append-able file/blob storage.
Data files cannot have data appended to them. New files/objects must
continually be generated for any kind of producer performance.
Streams are Micro Batched
Groups of records are written as a single operation to cloud
storage. Data lake writers must batch up streaming data
in order to write it to cloud storage. Writing individual records is
inefficient and may not even be possible depending on the run rate of the
message stream. Files in the Data Lake in the same dataset may be
different sizes based on the batch writing trigger definitions.
A streaming Lake Writer collects incoming messages and micro-batches them into blocks that are written to non-append-able cloud storage. Writes are commonly triggered based on time or record count. |
Multi-Channel Streams and Micro Batches
Streaming systems increase performance through partitioning and
sharding. They achieve virtually linear scale increases in the
number of messages handled, throughput and the amount of data
that can be stored.
Shard Writers
Aggregated Writers
This diagram shows two streams that are fed into the same Lake table. The data from the stream partitions maybe intermixed when written into cloud storage. This may result in fewer larger files as the streams are aggregated. |
Latency vs Batch Size
There is a design trade off between file efficiency / size and the
latency of when data will become visible to consumers. The
micro-batch writer may have to determine when to write the micro-batch to
blob store based on the amount of time since the last batch PUT and the
number of streamed items.
Short Video
Created 7/2020
Comments
Post a Comment