Data Lake - getting data into the zone
Data lakes exist to store and
expose data in its native format without size or format constraints. Cloud data storage makes it possible to store large amounts of data without
worrying about costs or data loss. Corporate lakes often store the same data multiple in transformed or enriched formats making them easier to use. My last two employers each had over 20 Petabytes of data in their lakes.
A well-managed lake organizes data based
on usage, data quality, data trust levels, governance policies, data sensitivity and information
lifecycles. Lake architects can spread their data across horizontal zones for purpose and/or vertical organization zones. The actual zones
for purpose vary by industry or company.
Zone Based Data Organization
This diagram demonstrates a zone structure that might be fit for a financial services company. It assumes that company generates its' own data and receives data from external organizations. Data exists in unstructured, semi-structured and structured format.
Data validation is shown as an optional step. It is probably a requirement for certain zones.
A lake is only as useful as its' data catalog and associated metadata. Data should be tracked and registered whenever it lands in one of the interior zones.
- DMZ: This is the landing zone for externally sourced data. Data is moved to the internal zones when it is considered safe for the internal network.
- Quarantine: This is the zone for data that has problems. Data stops here if it fails data type, data format, data sensitivity or other checks.
- Vault: This zone exists for immutable or secure documents. A company might put ID card images, contracts in this zone. Some companies provide vault storage to customers to securely hold their documents as a service.
- Unstructured: This zone exists to hold data that is not easily machine consumable. Examples include call recordings, other audio and video files.
- Structured Raw: This contains structured data in its original format irrespective of its utility or ease of use. This is the home zone for any data that requires lineage tracking.
- Structured Curated: Data in this zone has been created for a specific set of consumers. Machine Learning Features calculation results go into this zone. Copies of data in the Raw zone may be reformatted and transformed and stored in this zone. Materialized views and denormalization go here.
Data Analytics
Business Analysts and Data Scientists need their own production like sandboxes where they can explore and transform production data. Machine Learning training generates results data that must be stored somewhere for further analysis.
Transient Data Analytics zones may be a good solution. Data Analytic zones often automatically purge information after some period time on other conditions.
Other Zone Models
Sites have different views on the right zones based on their use cases and experience.
Source | Suggested Zones |
---|---|
SQL Chick Zones in a Data Lake 2017 | Raw, Transient, Reference, Drop, Staged, Standardized Raw, Archive, Sandbox, Curated |
Blue Granite Modern Data Architecture eBook | |
O'Reilly books: The Enterprise Data Lake | Raw, Production, Dev, Sensitive |
DZone Big Data Zone | Transient, Raw, Trusted, Refined |
NEOS Blog | Raw, Structured, Curated, Customer, Analytics |
Health Catalyst proposed healthcare zones | Raw, Trusted, Refined, Sandbox |
Zaloni blog | Transient, Raw, Trusted, Refined |
ScienceSoft blog | Landing, Staging, Analytics, Curated |
Access Controls
Zones may also be created or split to simplify coarse grain access controls.
History
Last modified 2019/11/25
Created 2019/11/11
Comments
Post a Comment