Schema drift - when historical and current should be different datasets
Data producers often create multiple versions of their data across time. Some of those changes are additive and easy to push from operational to analytical stores. Other changes are transformational breaking changes. We want to integrate these changes in a way that reduces the amount of magic required by our consumers.
All these changes should be captured in data catalogs where consumers can discover our datasets, their versions and, the datasets they feed.
Managing incompatible Producer Schemas
We can take this approach when providing consumers a unified
consumer-driven data model.
- Version all data sets and schema changes.
- Minor versions represent backward-compatible changes like table, column, or property additions.
- Major version numbers represent breaking changes from the previous versions.
- Data should be stored in separate raw zone data sets based on major version numbers.
- Some catalogs only support schema migration, non-breaking changes. In these cases, major versions may need to be registered as different data sets with cross-references.
- Consumer-driven models can be aggregates of several Producer Schema versions
Breaking changes evolution
Here we have a data source that exists across two major versions each
with its own two minor versions. In this case, the breaking change
was because of migration from one operational platform to another.
We create raw, producer-driven, data sets for each major version.
Minor compatible changes continue to live in the same data sets.
Click to Enlarge |
Here we have two possible options for conformed.
The first option retains all versions of conformed in a single data
set. This means that the conformed data set breaks the consumers
at some point in time.
The timeline implies that we convert all the original data to the 2.0
format to fit in the space. Another option is to put both versions
in the same data set and make consumers know when breaking changes
occur.
Click to Enlarge |
Version two leaves the original conformed zone as-is. The original zone ceases receiving data on the 2.0 release date. We create a new data set for post 2.0 and either have consumers read from original and updated or we backfill all the old data in the new format in the new dataset.
I'm suggesting some version of V2.
Versioning Lineage Complexity for Raw Zone Versions
We need to capture the source and destination data versions in our data catalog. Here we look at lineage when we create new datasets for breaking changes.
The first describes the lineage that results from creating new raw datasets
for just the new schema whenever we have incompatible changes. The diagram
shows that a single conformed zone contains the consumer model across all
versions.
We tie the various source versions to each other. We tie the dependent raw
zone versions back to the source versions they are derived from. In
this case all the raw zones only tie back to a single major source
version.
The second lineage flow describes the lineage that results when we create
a new raw dataset for the new schema that is backfilled with converted
versions of the pre-schema data. The diagram shows that a single conformed
zone contains the consumer model across all versions.
We tie the various source versions to each other. We die the
dependent raw zone versions back to the source versions they are derived
from. This means one of the raw zone datasets is derived from multiple
source data sets.
Sometimes you don't unify the consumer model
Large incoercible changes may mean segregating the data all the way to
the consumers. You may decide that you never want to merge the major
versions, leaving the handling to the consumers. This should be avoided
and is one of the reasons we create transformed consumer-driven models on
top of the various versions of raw data.
Video
Related
- https://joe.blog.freemansoft.com/2021/12/methods-of-supporting-schema-drift-in.html
Created 2021/12
Comments
Post a Comment