Shaping Big Data - Schema on Read or Schema on Write

Data Lakes often have the some of the same performance and security decisions as past year's data warehouses.  

Teams need to decide if the data in a lake is stored in producer formats or consumer formats or a combination of the two. 

Storage is essentially unlimited which means we may choose to store the data in multiple consumer oriented fashion.

Compute is essentially unlimited. We may decide to apply view style restrictions and access controls at read time. 
          

Video

Speaker's Notes

This discussion is really only about tabular style data stored in cloud blob/object stores.  See data lakes for squares. Record oriented data can be built up from files/objects that are organized in folders/partitions that are then aggregated into virtual tables.  

Data layout in the lake dramatically affects performance. Data objects are read in their entirety.  Joins are essentially external operators.  Views are not a native concept.  We may chose to organize the data based on the consumption patterns.

One option is to use the essentially unlimited compute to pull in data some format and filter the data as it streams to the client.  The client schema built as the data is read.

Schema on Read is often built by some type of proxy that sits between native cloud storage and the consumer.  See https://joe.blog.freemansoft.com/2020/05/fine-grained-controls-schema-on-read.html as an example of hos this could be implemented

Schema on Write means that the data is stored in the lake in format or formats that are targeted to the consumers or oriented towards simplifying attribute style filtering. Schema on write trades off space and potential data reorganizations against simplified security administration and the ability to use cloud native access controls.

The simplest version of Schema on Write is to create copies of the data for each consumer type.  A custom table is written to the cloud for each view pattern with a special emphasis on sensitive columns.  Consumers may end up with more columns than they need but never any they shouldn't see.  The example here has 3 tables, the original and two formatted to hide sensitive information from various consumers.

A slightly finer grained, more heavy handed approach is to break the data into groups. Consumers are given access to the appropriate column groups.  Essentially each of the column groups is its own table. The consumer reassembles the column groups to create a full table within the limits of their permissions. 

A totally cloud only approach is to break each data column into its own file/table.  We can take this approach because we are not constrained by normal DB stores or compute resources. This approach makes it possible to set permissions on individual columns using standard cloud access controls. Some data products built on cloud storage take this approach


Created 5/2020

Comments