Isolating Historical Data and Breaking Changes

Teams often run into situations where they have a data set that broke its compatibility at some period in time.  This often happens when you have historical data that came from a previous system.  We want the ability to combine that data in a way that consumers have to understand as little of that difference as possible.  

The differences between historical and active data are essentially a major version, breaking change to the data. The two major versions of the data can be isolated in their own raw storage area and then merged together in one of our consumer-driven zones.  We can continue to support minor version producer schema changes as they occur in one of the raw streams.  Those changes would then be handled in the transformation tier into the conformed zone.

We register and link the three data sets in our Data Governance Catalog. This lets us capture the data models while enforcing data change and compatibility rules. Disciplined organizations will also register the two transformations that feed into the shared dataset.

I'm going to claim that we should just say no to combining the two incompatible data sets in the same raw or curated zone.  We wouldn't do this with a relational database or with any type of schema versioning enforced streaming system.  We shouldn't do it here either.

Video


Created 10/2021

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk