Organizing the Raw Zone - Data straight from the Producers

The right approach for laying down raw data in your data lake or your Cloud Warehouse depends on your goals.

  • Are you trying to ensure the data lands exactly as sent for traceability?
  • Are you planning on transforming the data to a consumer model to decouple producers and consumers?
  • Are you have structure, semistructured, documents, or binaries?
  • Do you have PII exposure?


Video


Presentation Slides

This section exists to provide static copies of the material in the video. Additional content may be added over time.

We're talking as if you have a data pipeline that moves data from the producers into locations that are friendly to data consumers.  It could be a simple pipeline with just a couple steps or it could be something sophisticated that includes things like DataVault modeling layers. Two main things to think about. Who owns making the data consumable?  Are you capable of supporting an ongoing promotion process that converts data from producer schemas to consumer schemas?

Some organizations just drop the data in their raw zone for others to directly consume.  The rest of the slides assume that lake or warehouse data goes through one or more transformations to make it more usable by consumers.  

The next decision is to determine if any changes will be applied to incoming data before it is first landed. This decision can be more difficult than it sounds. It takes into account traceability, governance and capabilities.

A raw zone or landing zone strategy must take into account the different shapes and types of data. I've worked at places where our landing strategy only worked for flat or rectangular data.  It was all pretty straightforward and fell short of covering all our data. We had to come back later and design a binary document strategy.  Look at your data to get a feel for the variation.

We just talked about the different types of data.  Each of those data shapes has its own challenges with respect to size, shape, complexity, and data protection.  You may end up with a couple different strategies depending on the structure, size, and data type.  You can design a good system that totally collapses under a few high volume data sets that have occasional large records. Sometimes companies will pick a raw zone strategy that doesn't work with any of the standard tools for those data types.  For example, There are tradeoffs for embedding pictures/images inside other data types.

Watch the video for a discussion of Flat, Rectangular data in the raw zone.  I've also done several other blog articles about rectangular data in a data lake.

Watch the video for a discussion of Document or Nested structure data in the Raw Zone

Watch the video for a discussion of managing Binary data in a Raw Zone

Blog created 2021/12

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk