Loading both Cloud Data Lake and Warehouse

Let's map out how data can flow from the originating Operational Store to both the Data Lake and Data Warehouse. We have to decide if the Data Warehouse and Data Lake are peers or if one is the gold source and the other is a copy

Internet web and connected applications have created a data explosion.  Cheap storage and unlimited computing power are empowering new use cases like ML and revolutionizing old ones like CRM and CDP. Hive and Hadoop ushered in the age of big data. Data used to exist in two locations, operational and reporting databases. Now all data of all types can be collected into a single multi-petabyte Data Lake without expensive custom hardware.

Business requirements and regulatory needs should drive your design.


  1. The top diagram shows the originating systems loading the Data Lake and the Data Warehouse in parallel as peers. 
  2. The second diagram shows the Data Lake as the location for all data with some of that data replicated into the Warehouse.
  3. The third diagram is the inverse, showing the Warehouse as the primary data repository with the Data Lake as a copy/subset.
There are hybrid versions where the primary/copy paradigm varies based on the data type.  A columnar data warehouse might be the primary for binary data while the lake might be the primary for binary and unstructured data.

Video

Parallel Load


Source systems provide data in a stream or via drop location.  Lake and Warehouse-specific sinks pick up the data and store it on independent paths. The saved data is tracked in an external catalog and reconciled by an external process.

Advantages

  1. Data appears in each target store without having to route through an additional layer.
  2. The data stored in the raw zone is as close to the source system format as possible.  It is not mutated by transiting through an extra storage location.
  3. Binary and Unstructured data does not need special processing to land in the lake.
  4. Binary and Unstructured data does not have to transit through two systems.

Disadvantages

  1. Every transformation must be implemented twice, once for each channel.
  2. The specific data available in each repository overlaps with a Venn diagram due to the eventually consistent nature of the process. 
  3. Data loss in the two systems may be different.
  4. The data may be partitioned or ordered differently.

Lake First



All data is first pushed into the data lake.  Rectangular or relational friendly data is then passed on through to the data warehouse. 

Advantages

  1. Conformed and Curated zones are built in their own repository with repository-friendly tools.
  2. All data lands in the lake prior to being copied to another location.  This means the lake holds all of the data possible in other systems.
  3. Binary data can be peeled off early in the pipeline.
  4. Conversion from streaming or other formats into raw zone format only happens once.
  5. ML and other Big Data tools can see all the data, early.
  6. Data is safe at rest once it is written to the lake

Disadvantages

  1. The Data Lake is always the master copy and a superset of the Warehouse
  2. Data appears in the Warehouse after two delays, the lake load and the extract and warehouse load.

Warehouse First

All data is pushed into the Warehouse and then the Data Lake.  Binary data should probably be stored in the Lake as part of the Warehouse load. Raw data is then passed on through to the data warehouse. 

Advantages

  1. Conformed and Curated zones are built in their own repository with repository-friendly tools.
  2. All data lands in the warehouse prior to being copied to another location.  
  3. The SQL store is the analytical store where it can be used by humans.
  4. Conversion from streaming or other formats into raw zone format only happens once.
  5. Reporting and compliance see the data early.
  6. Data is safe as soon as it is written to the Warehouse

Disadvantages

  1. The content of the Data Lake is a disjoint overlap of the data in the Warehouse
  2. Data appears in the Lake after two delays, the warehouse load, and the extract and lakeload.
  3. Binary data should be peeled off early resulting in two different ways data appears in the Data Lake.

Wrap

These are possibilities. There are other patterns with their own strengths and weaknesses.  You may wish to pick a single pattern or a couple patterns that are then used based on the data type and consumer needs.

Created 2021 06 15

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk