Posts

Showing posts with the label Data Warehouse

Organizing the Raw Zone - Data straight from the Producers

Image
The right approach for laying down raw data in your data lake or your Cloud Warehouse depends on your goals. Are you trying to ensure the data lands exactly as sent for traceability? Are you planning on transforming the data to a consumer model to decouple producers and consumers? Are you have structure, semistructured, documents, or binaries? Do you have PII exposure? Video Presentation Slides This section exists to provide static copies of the material in the video. Additional content may be added over time. We're talking as if you have a data pipeline that moves data from the producers into locations that are friendly to data consumers.  It could be a simple pipeline with just a couple steps or it could be something sophisticated that includes things like DataVault modeling layers. Two main things to think about. Who owns making the data consumable?  Are you capable of supporting an ongoing promotion process that converts data from producer schemas  to c...

Explorations: supporting schema drift in consumer schemas

Image
We can merge incompatible schemas in a variety of ways to make a usable consumer-driven schema.  A previous blog article described how we should treat and track breaking schema changes.  We're are going to look at a couple of ways of merging producer different dataset versions into a single consumer dataset. A new Conformed dataset with both versions Example: We have a date field where the date changes from non-zoned to one that has a timezone.  Or it changes from implicitly zoned to UTC  The date changes from one timezone to another timezone like UTC. The source system has its own schema. Initially, it sends the data tied to a timezone without any zone info.  That producer model is then pushed into a conformed schema. For the purposes of this discussion, we will assume that it just got pushed without any conversion. Eventually, the source system decides to ship the data with Timezone info...

Schema drift - when historical and current should be different datasets

Image
Data producers often create multiple versions of their data across time. Some of those changes are additive and easy to push from operational to analytical stores.  Other changes are transformational breaking changes. We want to integrate these changes in a way that reduces the amount of magic required by our consumers.  All these changes should be captured in data catalogs where consumers can discover our datasets, their versions and, the datasets they feed. Managing incompatible Producer Schemas We can take this approach when providing consumers a unified consumer-driven data model. Version all data sets and schema changes. Minor versions represent backward-compatible changes like table, column, or property additions. Major version numbers represent breaking changes from the previous versions. Data should be stored in separate raw zone data sets ba...

Loading both Cloud Data Lake and Warehouse

Image
Let's map out how data can flow from the originating  Operational Store  to both the  Data Lake  and  Data Warehouse.  We have to decide if the  Data Warehouse  and  Data Lake  are  peers  or if one is  the gold source  and the other is a  copy .  Internet web and connected applications have created a data explosion.  Cheap storage and unlimited computing power are empowering new use cases like ML and revolutionizing old ones like CRM and CDP. Hive and Hadoop ushered in the age of big data. Data used to exist in two locations, operational and reporting databases. Now all data of all types can be collected into a single multi-petabyte Data Lake without expensive custom hardware. Business requirements and regulatory needs should drive your design. The top diagram shows the originating systems loading the Data Lake  and the Data Warehouse  in parallel as p...

Call Recordings and other Binary Data and Metadata in the Data Lake

Image
Data lakes hold data in any format.  This includes structured data, semi-structured text data, documents, and binary data. Organizing that binary data and its metadata can be done in several ways. Video Images in Video Welcome We're talking about binary data and its associated descriptive metadata.   This shows some of the metadata that could be associated with each call recording. The recording itself is  highly sensitive  because we don't know exactly what was said.   The extracted text is also highly sensitive  because it is a full text copy with the same risk. Media / binary files can add up.   We could have millions of call records and all of their associated metadata. It is a large data problem. We have to pick the format for the binary, non-rectangular, data and its associated metadata. We can use the native formats and links or embed the binary data inside another format. ...