Posts

Showing posts with the label Lake

Driving Schema Ownership and Organization

Image
The internet age and cloud competing moved data size, ownership, and privacy to an entirely new level. This is driving organizations to re-thinking where data is moved, how it is transformed, who can see it and, who owns it.   Let's talk about how data moves from producer systems through data pipelines and, into Consumable locations with correct access controls. Target State We want to create a well-organized data ecosystem that makes information available to data consumers. The resulting data should be organized for consumers and made available in a way that works with consumer tools. Data starts in source systems owned by applications and organized for the data producer's use cases. The data is extracted from the producer systems in a raw form and stored in a data ingestion zone or data lake. This stage represents the end of the Producer's involvement. The data is converted from producer oriented models to consumption independent models. Traditionally this has been a stan...

Streaming Can Make Consumption Complicated

Image
Streaming Data into a Lake is a powerful tool, modern, approach that replaces traditional ETL.  There are use cases where streaming  can make things difficult for business systems or direct data users. Your data ingestion tier may have to support both data streaming  and bulk processing. Speaker's Notes Notes to be added Notes to be added Notes to be added Notes to be added Notes to be added Notes to be added Notes to be added

Lake Mutability in the Age of Privacy and Regulation

Image
Data immutability was one of the core tenants of Data Lakes when they first became big.  Mutable data went to Relational and Document databases while immutable data and and documents were store in the lake .   Emerging privacy regulations and data sharing regulations are adding data retention, data visibility and data management rules and behaviors that may drive companies to re-think which data should be stored and how data should be stored in data lakes .  Video blog Phase 1: Data Set Storage Retention Retention times are are set on the file(s) that make up a dataset. Datasets are managed as files. Entire datasets are removed at the end of the retention period. Phase 2a: Partition Storage Retention Retention times are stored somewhere and bound to partition keys.  Data is organized as tables in a a table/partition/file format.  P...

Schema on Write - Consumer Driven Schemas

Image
What does it mean to move from a Relational Database style Schema on Read to  Schema on Write ? Schema on Write  is used to stage data in a consumer friendly form.  It can also be used in poor-join-performance environments to restructure and stage data in consumer read  format. It is pretty much mandatory for Document Databases.  Ingestion stores data in its original format for compliance, audit or other purposes.  This copy may be called True Source. Format Standardization converts the raw information into and agreed on standard format.  Examples include  Data Tables  in a lake or documents in a document store.  This is purely a mechanical conversion. Consumption Model are built from raw data, reference data and applies view and business rules creating a consumer ready dataset...

Shaping Big Data - Schema on Read or Schema on Write

Image
Data Lakes often have the some of the same performance and security decisions as past year's data warehouses.   Teams need to decide if the data in a lake is stored in producer formats or consumer formats or a combination of the two.  Storage is essentially unlimited which means we may choose to store the data in multiple consumer oriented fashion. Compute is essentially unlimited. We may decide to apply view style restrictions and access controls at read time.              Video Speaker's Notes This discussion is really only about tabular style data stored in cloud blob/object stores.  See data lakes for squares.  Record oriented data can be built up from fil...

Fine Grained Controls - Schema on Read with Cloud Data Lakes

Image
Data lakes are for all types of data but sometimes we treat parts of our data lake as data warehouses. Fine grained access controls can be used to provide view like functionality where we can filter out columns or rows based on access rules. Fine grained access controls are implemented on top of cloud object/blob store. They are only really implemented two and half ways. Recording Row and column access controls applied to cloud blob data <Speakers notes to be added>

Cloud Lake Storage - Files vs Tables

Image
The cost and scale of cloud storage makes it possible to store large amounts of data in almost any format.  We can build abstractions on top of this simple cheap highly available storage that lets us access it as if it were something more sophisticated. Let's divide the data into two coarse types,  unstructured  and  structured . Unstructured data, in this case, is any data where the content is un-typed, is not easily machine parse-able or does not store data with a tight schema.  Sound, media, pictures and PDF documents are examples of unstructured data for this discussion. Structured data has a deterministic content format and if often designed for machine consumption.  Let us break structured into columnar-record and non-columnar record formats.  Most columnar record formats are designed for many records per data object.  Non-columnar formats may be single record per file or may not be easily flatten able into record format. We ill...

Data Lake - getting data into the zone

Image
Data lakes exist to store and expose data in its native format without size or format constraints. Cloud data storage makes it possible to store large amounts of data without worrying about costs or data loss.  Corporate lakes often store the same data multiple in transformed or enriched formats making them easier to use.   My last two employers each had over 20 Petabytes of data in their lakes. A well-managed lake organizes data based on usage, data quality, data trust levels, governance policies, data sensitivity and information lifecycles. Lake architects can spread their data across horizontal zones for purpose and/or vertical organization zones .   The actual zones for purpose vary by industry or company. Zone Based Data Organization This diagram demonstrates a zone structure that might be fit for a financial services company.  It assumes that company generates its' own data and receives data from external organizations.  Data exists in unstru...

Data Lakes are not just for squares

Image
Columnar-only lakes are just another warehouse Data Lakes are intended to be a source of truth for their varied data. Some organizations restrict their lake to columnar data, violating one of the main precepts behind Data Lakes. They limit data lake to be used for large data set transformations or automated analytics. This limiting definition leaves those companies without anywhere to store a significant subset of their total data pool data. Data Lakes are not restricted Data lakes hold data in its' original data format to retain data fidelity. All data sets retain their original structure, data types and raw data format. Some enterprise data lakes make the data more usable  by storing the same data in multiple formats , the original format and a more queryable, accessible format.  This approach exactly preserves the original data while making more accessible. Examples of multiple-copy same-data storage include. CSV and other data that is also stored in dire...