Explorations: supporting schema drift in consumer schemas

December 15, 2021

We can merge incompatible schemas in a variety of ways to make a usable consumer-driven schema. A previous blog article described how we should treat and track breaking schema changes. We're are going to look at a couple of ways of merging producer different dataset versions into a single consumer dataset.

A new Conformed dataset with both versions

Example: We have a date field where the date changes from non-zoned to one that has a timezone. Or it changes from implicitly zoned to UTC The date changes from one timezone to another timezone like UTC.

The source system has its own schema. Initially, it sends the data tied to a timezone without any zone info. That producer model is then pushed into a conformed schema. For the purposes of this discussion, we will assume that it just got pushed without any conversion.

Eventually, the source system decides to ship the data with Timezone info or as a different timezone. this is actually a breaking change in the producer data model. So we put that raw, unchanged data, into a new raw dataset.

We then decide that the conformed (consumer) zone should be in UTC or contain the timezone information. That is a breaking change in conformed. We may not want to break all the consumers or be able to recast the data type. In this case, we create a new conformed data set. We backfill that data set with the original version data but with a transformation to align the timezone. We add the new data in the new format to that. This removes any doubt as to what is in both conformed schemas and doesn't silently break consumers by changing the meaning of a field.

Click to Enlarge

Versioning the records in Conformed

Example: In this example, we change the cardinality of a field. This is a breaking change coming from the source system.

We create new datasets whenever we have breaking schema changes. In this case, we also capture the schema version in each of the raw datasets. That version number may have come with the producer data or it may be added as part of ingestion. You can see two raw zones. One with single cardinality, v1 and one with multiple cardinality v2.

We modify the conformed schema to support both cardinalities and to include the major schema version number. The data catalog tells consumers which columns are valid for each schema version number. Legacy consumers do not break against older data. They must be updated when they want to receive the newer data.

Click to Enlarge

We went from single cardinality to multiple cardinalities in this example. In this case, we promoted the single cardinality item into a list containing one element and put it in the multiple cardinality field. This lets us carry data into a breaking change without loss of fidelity.

Video

https://joe.blog.freemansoft.com/2021/12/schema-drift-when-historical-and.html

Created 2021/12

Blog de Joe Freeman