--> Skip to main content



Streaming is Micro-Batching when feeding a Data Lake

Streaming data in Data Lakes is one of those ideas that doesn't work exactly as you expect it to.  Your pretty big data streams are converted sent into your data lake by old school batch process. 
The primary reason for this is the way that data is stored in the cloud in non-append-able file/blob storage. Data files cannot have data appended to them. New files/objects must continually be generated for any kind of producer performance. 
There are other issues that may drive a project to total old school batch style ingestion. This often happens due to the batch oriented nature of data producers,  the need to store data in exactly the same format , the need to store data in exactly the same order as it was presented to the lake.Streaming Data 
Streaming sends individual documents or records to interested parties. Those streaming listeners then write the data to their appropriate storage. Messages are often limited in size due to streaming and streaming storage restrictions.  Uniqueness …

Latest Posts

Lake Mutability in the Age of Privacy and Regulation

Docker on Azure PaaS - Tika Parser

Avoid the Agile Grind - Iterative team leadership

Schema on Write - Consumer Driven Schemas

Shaping Big Data - Schema on Read or Schema on Write

Fine Grained Controls - Schema on Read with Cloud Data Lakes

Broadly Communicated and Well Understood - Does it Exist

Recognizing where you are a One Deep Organization

Using AWS Elastic Beanstalk as a guide for application configuration

Sales Engineer Guide: Yes means "yes" - Explanations mean "no"