Schema drift - when historical and current should be different datasets

December 15, 2021

Data producers often create multiple versions of their data across time. Some of those changes are additive and easy to push from operational to analytical stores. Other changes are transformational breaking changes. We want to integrate these changes in a way that reduces the amount of magic required by our consumers.

All these changes should be captured in data catalogs where consumers can discover our datasets, their versions and, the datasets they feed.

Managing incompatible Producer Schemas

We can take this approach when providing consumers a unified consumer-driven data model.

Version all data sets and schema changes.

Minor versions represent backward-compatible changes like table, column, or property additions.
Major version numbers represent breaking changes from the previous versions.

Data should be stored in separate raw zone data sets based on major version numbers.
Some catalogs only support schema migration, non-breaking changes. In these cases, major versions may need to be registered as different data sets with cross-references.
Consumer-driven models can be aggregates of several Producer Schema versions

Breaking changes evolution

Here we have a data source that exists across two major versions each with its own two minor versions. In this case, the breaking change was because of migration from one operational platform to another.

We create raw, producer-driven, data sets for each major version. Minor compatible changes continue to live in the same data sets.

Click to Enlarge

Here we have two possible options for conformed.

The first option retains all versions of conformed in a single data set. This means that the conformed data set breaks the consumers at some point in time.

The timeline implies that we convert all the original data to the 2.0 format to fit in the space. Another option is to put both versions in the same data set and make consumers know when breaking changes occur.

Click to Enlarge

Version two leaves the original conformed zone as-is. The original zone ceases receiving data on the 2.0 release date. We create a new data set for post 2.0 and either have consumers read from original and updated or we backfill all the old data in the new format in the new dataset.

I'm suggesting some version of V2.

Versioning Lineage Complexity for Raw Zone Versions

We need to capture the source and destination data versions in our data catalog. Here we look at lineage when we create new datasets for breaking changes.

The first describes the lineage that results from creating new raw datasets for just the new schema whenever we have incompatible changes. The diagram shows that a single conformed zone contains the consumer model across all versions.

We tie the various source versions to each other. We tie the dependent raw zone versions back to the source versions they are derived from. In this case all the raw zones only tie back to a single major source version.

Click to Enlarge

The second lineage flow describes the lineage that results when we create a new raw dataset for the new schema that is backfilled with converted versions of the pre-schema data. The diagram shows that a single conformed zone contains the consumer model across all versions.

We tie the various source versions to each other. We die the dependent raw zone versions back to the source versions they are derived from. This means one of the raw zone datasets is derived from multiple source data sets.

Sometimes you don't unify the consumer model

Large incoercible changes may mean segregating the data all the way to the consumers. You may decide that you never want to merge the major versions, leaving the handling to the consumers. This should be avoided and is one of the reasons we create transformed consumer-driven models on top of the various versions of raw data.

Video

https://joe.blog.freemansoft.com/2021/12/methods-of-supporting-schema-drift-in.html

Created 2021/12

Blog de Joe Freeman