Streaming Data Concerns for the Unwary
There are other issues that may drive a project to total old school batch style ingestion. This often happens due to the batch oriented nature of data producers, the need to store data in exactly the same format , the need to store data in exactly the same order as it was presented to the lake.Streaming Data
High Volume Streaming with Shards
Streaming Data
Streaming sends individual documents or records to interested parties. Those streaming listeners then write the data to their appropriate storage. Messages are often limited in size due to streaming and streaming storage restrictions.
Uniqueness Not Guaranteed
Messages/data may be presented to the consumers or to the lake multiple times. This can happen because of failover events in the streaming system, consumers that die in the middle of a process or failed batch read operations. Consumers and data sinks must be able to live with this behavior.
High Volume Streaming with Shards
High volume systems throughput by increasing the parallelism of the system. We increase the number of data channels or the number of consumers. As will all solutions this creates problems of its own.
This streaming system is sharded to increase throughput. Here, 1/3 of the data ends up in each shard. Each shard can be consumed independently of other shards. Note: We're using Kafka nomenclature here. |
Order Not Guaranteed
It is not guaranteed that the data will arrive in the consuming systems in the same order as it was sent without specific additional processing. Order may be guaranteed in a stream but processing will not be guaranteed across streams. Stream consumers may run at different speeds or fail in ways that cause the data to be reprocessed and appear out of order.
Comments
Post a Comment