Posts

Showing posts with the label Data Lake

Organizing the Raw Zone - Data straight from the Producers

Image
The right approach for laying down raw data in your data lake or your Cloud Warehouse depends on your goals. Are you trying to ensure the data lands exactly as sent for traceability? Are you planning on transforming the data to a consumer model to decouple producers and consumers? Are you have structure, semistructured, documents, or binaries? Do you have PII exposure? Video Presentation Slides This section exists to provide static copies of the material in the video. Additional content may be added over time. We're talking as if you have a data pipeline that moves data from the producers into locations that are friendly to data consumers.  It could be a simple pipeline with just a couple steps or it could be something sophisticated that includes things like DataVault modeling layers. Two main things to think about. Who owns making the data consumable?  Are you capable of supporting an ongoing promotion process that converts data from producer schemas  to c...

Explorations: supporting schema drift in consumer schemas

Image
We can merge incompatible schemas in a variety of ways to make a usable consumer-driven schema.  A previous blog article described how we should treat and track breaking schema changes.  We're are going to look at a couple of ways of merging producer different dataset versions into a single consumer dataset. A new Conformed dataset with both versions Example: We have a date field where the date changes from non-zoned to one that has a timezone.  Or it changes from implicitly zoned to UTC  The date changes from one timezone to another timezone like UTC. The source system has its own schema. Initially, it sends the data tied to a timezone without any zone info.  That producer model is then pushed into a conformed schema. For the purposes of this discussion, we will assume that it just got pushed without any conversion. Eventually, the source system decides to ship the data with Timezone info...

Schema drift - when historical and current should be different datasets

Image
Data producers often create multiple versions of their data across time. Some of those changes are additive and easy to push from operational to analytical stores.  Other changes are transformational breaking changes. We want to integrate these changes in a way that reduces the amount of magic required by our consumers.  All these changes should be captured in data catalogs where consumers can discover our datasets, their versions and, the datasets they feed. Managing incompatible Producer Schemas We can take this approach when providing consumers a unified consumer-driven data model. Version all data sets and schema changes. Minor versions represent backward-compatible changes like table, column, or property additions. Major version numbers represent breaking changes from the previous versions. Data should be stored in separate raw zone data sets ba...

Isolating Historical Data and Breaking Changes

Image
Teams often run into situations where they have a data set that broke its compatibility at some period in time.  This often happens when you have historical data that came from a previous system.  We want the ability to combine that data in a way that consumers have to understand as little of that difference as possible.   The differences between historical and active data are essentially a major version, breaking change to the data. The two major versions  of the data can be isolated in their own raw storage area and then merged together in one of our consumer-driven  zones.  We can continue to support minor version producer schema changes as they occur in one of the raw streams.  Those changes would then be handled in the transformation tier into the conformed zone. We register and link the three data sets in our Data Governance Catalog. This lets us capture the data models while enforcing data change and compatibility rules. Disciplined organiz...

Loading both Lake and Warehouse - Single Transform Path

Image
Data Organization, build-vs-buy, transform audit, and technology choices all depend on your organization's policies, business, and compliance requirements. We are going to look at some business requirements that might put us on a different path from the parallel load, warehouse first, and lake first patterns previously discussed. Video Discussion This pattern assumes that all the primary  raw  and conformed/curated  transformations happen in one data repository with one set of tools.  The raw and conformed/curated zones are then replicated into the other repository.  Your org would choose whether the lake or the warehouse was home for transformations for those zones. 

Loading both Cloud Data Lake and Warehouse

Image
Let's map out how data can flow from the originating  Operational Store  to both the  Data Lake  and  Data Warehouse.  We have to decide if the  Data Warehouse  and  Data Lake  are  peers  or if one is  the gold source  and the other is a  copy .  Internet web and connected applications have created a data explosion.  Cheap storage and unlimited computing power are empowering new use cases like ML and revolutionizing old ones like CRM and CDP. Hive and Hadoop ushered in the age of big data. Data used to exist in two locations, operational and reporting databases. Now all data of all types can be collected into a single multi-petabyte Data Lake without expensive custom hardware. Business requirements and regulatory needs should drive your design. The top diagram shows the originating systems loading the Data Lake  and the Data Warehouse  in parallel as p...

Call Recordings and other Binary Data and Metadata in the Data Lake

Image
Data lakes hold data in any format.  This includes structured data, semi-structured text data, documents, and binary data. Organizing that binary data and its metadata can be done in several ways. Video Images in Video Welcome We're talking about binary data and its associated descriptive metadata.   This shows some of the metadata that could be associated with each call recording. The recording itself is  highly sensitive  because we don't know exactly what was said.   The extracted text is also highly sensitive  because it is a full text copy with the same risk. Media / binary files can add up.   We could have millions of call records and all of their associated metadata. It is a large data problem. We have to pick the format for the binary, non-rectangular, data and its associated metadata. We can use the native formats and links or embed the binary data inside another format. ...

Cloud Data Lake vs Warehouse - fit for purpose

Image
Data Lakes and Data Warehouses each have their own strengths and weaknesses.  You may need one or the other depending on your needs. Look at your use cases to determine whether it makes to have one or the other or both.  Maybe this can help you with more things to think about when making a decision of one over the other. My general experience has been  Data Lakes tend to be the choice when feeding operational systems and when storing binary data.  They are often used for massive data transformations or ML Feature creation. Sometimes security concerns and partitions may drive highly sensitive data to protected lakes. Data Warehouses tend to be the choice when humans need big data for reporting, data exploration, and collaborative environments. Use cases that put them in the middle of data flows for operational systems should be evaluated for uptime and latency. Different companies will prioritize differently.  I've seen companies that were lake only ,...

Tokenizing Sensitive Information - PII Protection

Image
The only way to protect sensitive information is to remove the sensitive values everywhere they are not absolutely needed. Data designers can remove the fields completely or change the field values so that they are useless in the case of data theft.  Data tokenization and Data encryption are two possible solutions to this issue.  Both approaches must be implemented in a way that they return the same non-PII value for a given PII value every time they are invoked. We're going to talk about tokenization here. Tokenized field values must be changed in a repeatable way so that the attributes still be useful for joining data in queries or reports. This means every data set with the same value for the same PII field will have the same replaced value.  This lets us retain the ability to join across datasets or tables using sensitive data fields.  Every PII field has a typecode or a key.  That type is used whenever...

The Future is Zero PII in Lakes and Analytical Stores

Image
The only way to protect PII is to remove it from your Lake or other Analytical Stores.  New regulations and laws create stiff penalties for data leaks and give consumers or customers right to know all the place their data is used.  We want to remove PII to meet new regulations while still retaining enough information to join across datasets. Recorded Talk Speakers Notes Speaker Notes not yet available. Speaker Notes not yet available. Speaker Notes not yet available. Speaker Notes not...