Streaming Data Concerns for the Unwary

July 09, 2020

There are other issues that may drive a project to total old school batch style ingestion. This often happens due to the batch oriented nature of data producers, the need to store data in exactly the same format , the need to store data in exactly the same order as it was presented to the lake.

Streaming Data

Streaming sends individual documents or records to interested parties. Those streaming listeners then write the data to their appropriate storage. Messages are often limited in size due to streaming and streaming storage restrictions.

The goal is to use our common data streams to update various consumers and populate our Data Lake and Data Warehouse. This diagram shows three consumers, one of which is used to write data to the Data Lake.

Uniqueness Not Guaranteed

Messages/data may be presented to the consumers or to the lake multiple times. This can happen because of failover events in the streaming system, consumers that die in the middle of a process or failed batch read operations. Consumers and data sinks must be able to live with this behavior.

High Volume Streaming with Shards

High volume systems throughput by increasing the parallelism of the system. We increase the number of data channels or the number of consumers. As will all solutions this creates problems of its own.

This streaming system is sharded to increase throughput. Here, 1/3 of the data ends up in each shard. Each shard can be consumed independently of other shards. Note: We're using Kafka nomenclature here.

Order Not Guaranteed

It is not guaranteed that the data will arrive in the consuming systems in the same order as it was sent without specific additional processing. Order may be guaranteed in a stream but processing will not be guaranteed across streams. Stream consumers may run at different speeds or fail in ways that cause the data to be reprocessed and appear out of order.

Blog de Joe Freeman