Showing posts from June, 2021

Architectural Decision Records

Create ADRs when making significant and impactful decisions.  Retain those decisions in a way that ephemeral media like email, video calls, and IM channels do not. Architectural Decision Records (ADRs) capture architecturally significant decisions or design choices and the context and consequences  for the decisions. ADRs exist to help people in the future understand why an approach was selected over others.  This includes  future you, new team members, and future teams that take over responsibility for systems or processes that were impacted by the ADR.  The typical ADR consists of: Components Comment Problem Statement The problem statement that drove the need for an ADR Current Status Draft, approved, declined, etc.  Some ADRs will be abandon for other approaches or changes in direction. Alternatives The viable alternatives that were considered and their pros and cons. Decision The chosen solution including enough information to give direction to those that did not

AWS Sagemaker Autopilot enables ML as a commodity

Two Parts ML for the masses Covid Intake ML demonstration from 2021 Snowflake Summit Accelerating the ML revolution Sagemaker Autopilot is moving ML from custom programming to a commodity service The end of the need for Custom ML platforms ML for the masses with less investment and startup costs Easy access to open data Partner data sharing with manageable risk Snowflake and AWS Sagemaker Autopilot Snowflake and AWS provided a low code demonstration that merged public health data with intake surveys to create a set of Machine model-based services that could help prioritize covid intake patients based on past patterns.   The truly interesting part of the demonstration is that they Restructured and merged data sets inside Snowflake with simple SQL Create a full ML environment Created and trained a model Deployed the model a

Loading both Cloud Data Lake and Warehouse

Let's map out how data can flow from the originating  Operational Store  to both the  Data Lake  and  Data Warehouse.  We have to decide if the  Data Warehouse  and  Data Lake  are  peers  or if one is  the gold source  and the other is a  copy .  Internet web and connected applications have created a data explosion.  Cheap storage and unlimited computing power are empowering new use cases like ML and revolutionizing old ones like CRM and CDP. Hive and Hadoop ushered in the age of big data. Data used to exist in two locations, operational and reporting databases. Now all data of all types can be collected into a single multi-petabyte Data Lake without expensive custom hardware. Business requirements and regulatory needs should drive your design. The top diagram shows the originating systems loading the Data Lake  and the Data Warehouse  in parallel as peers.  The second diagram shows the Data Lake  as the location for all data with some of that data replic

Call Recordings and other Binary Data and Metadata in the Data Lake

Data lakes hold data in any format.  This includes structured data, semi-structured text data, documents, and binary data. Organizing that binary data and its metadata can be done in several ways. Video Images in Video Welcome We're talking about binary data and its associated descriptive metadata.   This shows some of the metadata that could be associated with each call recording. The recording itself is  highly sensitive  because we don't know exactly what was said.   The extracted text is also highly sensitive  because it is a full text copy with the same risk. Media / binary files can add up.   We could have millions of call records and all of their associated metadata. It is a large data problem. We have to pick the format for the binary, non-rectangular, data and its associated metadata. We can use the native formats and links or embed the binary data inside another format. Here are two of the major options. Bin

Cloud Data Lake vs Warehouse - fit for purpose

Data Lakes and Data Warehouses each have their own strengths and weaknesses.  You may need one or the other depending on your needs. Look at your use cases to determine whether it makes to have one or the other or both.  Maybe this can help you with more things to think about when making a decision of one over the other. My general experience has been  Data Lakes tend to be the choice when feeding operational systems and when storing binary data.  They are often used for massive data transformations or ML Feature creation. Sometimes security concerns and partitions may drive highly sensitive data to protected lakes. Data Warehouses tend to be the choice when humans need big data for reporting, data exploration, and collaborative environments. Use cases that put them in the middle of data flows for operational systems should be evaluated for uptime and latency. Different companies will prioritize differently.  I've seen companies that were lake only , companies that had

Streaming Ecosystems Still Need Extract and Load

Enterprises move from batch to streaming data ingestion in order to make data available in a more near time  manner. This does not remove the need for extract and load capabilities.  Streaming systems only operate on data that is in the stream  right now .  There is no data available from a time outside of the retention window or from prior to system implementation.  There is a whole other set of lifecycle operations that require some type of bulk operations. Examples include: Initial data loads where data was collected prior or outside of streaming processing. Original event streams may need to be re-ingested because they were mis-processed or because you may wish to extract the data differently. Original event streams fixed/ modified and re-ingested in order to fix errors or add information in the operational store. Privacy and retention rules may require the generation of synthetic events to make data change