Posts

Showing posts with the label Streaming

Event the heck out of it so that you can drive insights and and keep options open

Image
Business and Technical events are an early, easy, way to capture activity, notify other systems of activity, capture technical changes, and log executed business functions. The information needed for those events tends to reside in specific steps in an execution flow. This means we often need to insert event generation probes in multiple places and at multiple levels. Product owners need visibility into any business functions or services that are performed.  Technical teams need visibility into the detailed activities executed in a system.  Partner business functions and data stores need a way to reconstruct the data as it existed at a specific times Databases often only provide a current or point-in-time state.  Logs have PII restrictions. Metrics are statistical by their nature. 

Isolating Historical Data and Breaking Changes

Image
Teams often run into situations where they have a data set that broke its compatibility at some period in time.  This often happens when you have historical data that came from a previous system.  We want the ability to combine that data in a way that consumers have to understand as little of that difference as possible.   The differences between historical and active data are essentially a major version, breaking change to the data. The two major versions  of the data can be isolated in their own raw storage area and then merged together in one of our consumer-driven  zones.  We can continue to support minor version producer schema changes as they occur in one of the raw streams.  Those changes would then be handled in the transformation tier into the conformed zone. We register and link the three data sets in our Data Governance Catalog. This lets us capture the data models while enforcing data change and compatibility rules. Disciplined organiz...

Streaming Ecosystems Still Need Extract and Load

Image
Enterprises move from batch to streaming data ingestion in order to make data available in a more near time  manner. This does not remove the need for extract and load capabilities.  Streaming systems only operate on data that is in the stream  right now .  There is no data available from a time outside of the retention window or from prior to system implementation.  There is a whole other set of lifecycle operations that require some type of bulk operations. Examples include: Initial data loads where data was collected prior or outside of streaming processing. Original event streams may need to be re-ingested because they were mis-processed or because you may wish to extract the data differently. Original event streams fixed/ modified and re-ingested in order to fix errors or add information in the operational store. Privacy and retention rules may require the generation of synthetic ev...

Streaming Can Make Consumption Complicated

Image
Streaming Data into a Lake is a powerful tool, modern, approach that replaces traditional ETL.  There are use cases where streaming  can make things difficult for business systems or direct data users. Your data ingestion tier may have to support both data streaming  and bulk processing. Speaker's Notes Notes to be added Notes to be added Notes to be added Notes to be added Notes to be added Notes to be added Notes to be added

Streaming Data Concerns for the Unwary

Image
There are other issues that may drive a project to total old school batch style ingestion. This often happens due to the batch oriented nature of data producers,  the need to store data in  exactly the  same format  , the need to store data in  exactly the same order  as it was presented to the lake. Streaming Data  Streaming  sends individual documents or records to interested parties. Those streaming listeners then write the data to their appropriate storage. Messages are often limited in size due to streaming and streaming storage restrictions.  The goal is to use our common data streams to update various consumers and populate our Data Lake and Data Warehouse. This diagram shows three consumers, one of which is used to write data to the Data Lake. Uniqueness Not Guaranteed Messages/data may be presented to the consumers or to the lake multiple times.  This can happen because of failover events in the streaming system, consumers ...

Streaming is Micro-Batching when feeding a Data Lake

Image
Streaming data in Data Lakes is one of those ideas that doesn't work exactly as you expect it to.   Your pretty data streams are packed up and sent into your data lake by old school batch process.  The primary reason for this is the way that data is stored in the cloud in non-append-able file/blob storage. Data files cannot have data appended to them . New files/objects must continually be generated for any kind of producer performance.  Streams are Micro Batched Groups of records are written as a single operation to cloud storage.  Data lake writers must batch up  streaming data in order to write it to cloud storage. Writing individual records is inefficient and may not even be possible depending on the run rate of the message stream. Files in the Data Lake in the same dataset may be different sizes based on the batch writing trigger definitions. ...

Protect messaging and streaming data in the cloud with "data key" encryption

Image
The best approach for protecting data in message queues and data streams is to not put any sensitive data in the message. Some systems use a claim check  model where the messages contain just resource identifiers that can be used passed to the originating system to retrieve the data. The  Claim check approach   creates tighter coupling between the producer and consumers. It puts an additional burden on the producer to be able to cough up the data associated for the claim for some period of time.  Some systems sometimes have to create caching architectures to store the claims for retrieval adding additional complexity to the producer.  Data / payload encryption is an alternative approach that can be used to protect data stored in messaging systems or on disk. Sensitive data is encrypted and put into the message payload.  Producers and consumers only need share access to encryption or decryption keys. This is easy in cloud environments which have services b...