Data Lake - getting data into the zone

Data lakes exist to store and expose data in its native format without size or format constraints. Cloud data storage makes it possible to store large amounts of data without worrying about costs or data loss. Corporate lakes often store the same data multiple in transformed or enriched formats making them easier to use. My last two employers each had over 20 Petabytes of data in their lakes.

A well-managed lake organizes data based on usage, data quality, data trust levels, governance policies, data sensitivity and information lifecycles. Lake architects can spread their data across horizontal zones for purpose and/or vertical organization zones.  The actual zones for purpose vary by industry or company.

Zone Based Data Organization


This diagram demonstrates a zone structure that might be fit for a financial services company.  It assumes that company generates its' own data and receives data from external organizations.  Data exists in unstructured, semi-structured and structured format. 

Data validation is shown as an optional step. It is probably a requirement for certain zones.

A lake is only as useful as its' data catalog and associated metadata.  Data should be tracked and registered whenever it lands in one of the interior zones.
  • DMZThis is the landing zone for externally sourced data.  Data is moved to the internal zones when it is considered safe for the internal network.
  • Quarantine: This is the zone for data that has problems. Data stops here if it fails data type, data format, data sensitivity or other checks.  
  • Vault: This zone exists for immutable or secure documents.  A company might put ID card images, contracts in this zone. Some companies provide vault storage to customers to securely hold their documents as a service.
  • UnstructuredThis zone exists to hold data that is not easily machine consumable.  Examples include call recordings, other audio and video files.
  • Structured Raw: This contains structured data in its original format irrespective of its utility or ease of use.  This is the home zone for any data that requires lineage tracking.
  • Structured Curated Data in this zone has been created for a specific set of consumers.  Machine Learning Features calculation results go into this zone. Copies of data in the Raw zone may be reformatted and transformed and stored in this zone.  Materialized views and denormalization go here.

Data Analytics

Business Analysts and Data Scientists need their own production like sandboxes where they can explore and transform production data.  Machine Learning training generates results data that must be stored somewhere for further analysis. 

Transient Data Analytics zones may be a good solution.  Data Analytic zones often automatically purge information after some period time on other conditions.

Other Zone Models

Sites have different views on the right zones based on their use cases and experience.

SourceSuggested Zones
SQL Chick Zones in a Data Lake 2017 Raw, Transient, Reference, Drop, Staged, Standardized Raw, Archive, Sandbox, Curated
Blue Granite Modern Data Architecture eBook
O'Reilly books: The Enterprise Data Lake Raw, Production, Dev, Sensitive
DZone Big Data Zone Transient, Raw, Trusted, Refined
NEOS Blog Raw, Structured, Curated, Customer, Analytics
Health Catalyst proposed healthcare zones Raw, Trusted, Refined, Sandbox
Zaloni blog Transient, Raw, Trusted, Refined
ScienceSoft blog Landing, Staging, Analytics, Curated

Access Controls

Zones may also be created or split to simplify coarse grain access controls.

History

Last modified 2019/11/25
Created 2019/11/11

Comments

Popular posts from this blog

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

DNS for Azure Point to Site (P2S) VPN - getting the internal IPs