Driving Schema Ownership and Organization

The internet age and cloud competing moved data size, ownership, and privacy to an entirely new level. This is driving organizations to re-thinking where data is moved, how it is transformed, who can see it and, who owns it.  

Let's talk about how data moves from producer systems through data pipelines and, into Consumable locations with correct access controls.

Target State

We want to create a well-organized data ecosystem that makes information available to data consumers. The resulting data should be organized for consumers and made available in a way that works with consumer tools.
  1. Data starts in source systems owned by applications and organized for the data producer's use cases.
  2. The data is extracted from the producer systems in a raw form and stored in a data ingestion zone or data lake. This stage represents the end of the Producer's involvement.
  3. The data is converted from producer oriented models to consumption independent models. Traditionally this has been a standard normalized form but many big data teams use other models like Data Vault. This phase is essentially owned by the data teams. The model here may not be friendly to producers or consumers.  
  4. Data is then combined and transformed into schemas that are organized for the consumers.  There is usually a de-normalization phase when creating relational database style data sets.
  5. Note that we don't show any end-user schemas like a custom report schema.  End-User schemas create lineage and audit risks. They may be needed but IMO we would reduce these to as little as possible or provide a mechanism for managing end-user schemas as corporate assets.
The Data Engineering lane represents data organized in some corporate or department standard format.  These are often 
  1. Mechanically de-normalized
  2. An industry-standard model type like Data Vault 2.0
  3. A corporate data model standard.
This approach can require extra transformations, even for trivial data sets that could be directly consumed.

Video

Current State

The organizational current state is often a mish-mash of styles.  Companies may take different approaches in different departments. I've worked places that have both State 1 and State 2 described here.

Common Current State 1

This represents a common current state. In this state, consuming teams are often forced to directly use the producer-provided schemas in the raw data sets. It has the advantage of implementing as few transformations as possible.

Consumers may directly consume producer data. Consumers directly consume the producer oriented model. This creates a hard coupling that makes it hard to change the producer provided data. 

Common Current State 2

Some organizations have no true standard intermediate format. They don't have a canonical data model or the notion of a canonical or conformed zone.  Consumers in those organizations often consume producer models directly or create their own data models.   Consumer built models have compliance, lineage, and data correctness issues. Their transforms are often part of shadow-IT.  Their end-user models reside in places where they cannot be audited or scanned.

Imperatives

Contemporary data regulations and requirements mean organizations need to apply more discipline to their data creation and consumption.  Privacy and other rules mean data needs to be used and placed in a way that can be seen or determined by auditors.  

  1. Create a standard data movement pipeline with well-understood stages
  2. Decide which raw model you want to use, simple de-normalized, corporate standard canonical data model, Data Vault 2.0 or, some other standard.
  3. Create standard processes for creating customer models from the raw / conformed data models

Comments

Popular posts from this blog

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

DNS for Azure Point to Site (P2S) VPN - getting the internal IPs