Failure Mode Analysis - Step One - throwing down failures.

We can make it better if we measure or analyze it. Let's analyze a small program in order to determine how it might fail and what we can do about it. 

We will break down a software program into smaller modules and look at how each phase or component might fail.  We will also look for silent failures or a lack of success metrics where something didn't occur at a time when there should have been some activity.

Sample System Under Analysis

Our example system is a data lake sink that 
  1. Reads streaming data 
  2. Validates the data
  3. Bundles the data into micro-batch sets
  4. Writes the data to a data lake. 
  5. Each lake write has a corresponding metrics push that updates our metrics store statistics and other features.

Video Walkthrough

In this video, we throw down as many failures as we can think of. We can worry about detection and remediation in a later phase



Worksheet Template

We will record the identified failure modes using a worksheet like this one.
ComponentFaultSeverityLikelyDetection RemediationTech or Business

  • Component: A subsystem or module we can tie the possible failure to
  • Fault: A specific failure.  We define these as specific as possible.
  • Detection: This describes how we would detect this problem. It could be automated or manual as part of some regular process. The worst situation is if your detection method is angry customers call us.
  • Remediation: A manual or automated process that can be used to fix the problem or park the problem for later work
  • Severity: The risk / damage that occurs with this type fo failure
  • Likelihood: Pretty much the frequency of this problem.  
Severity and Likelihood are used to determine the order for creating remediation processes.

Streaming Sink Worksheet

Faults identified during our 10 minute session.  The analysis moved from left to right through the components in the diagram.

We primarily went after the Faults in this section.  We took half-hearted attempts at Remediation and detection.

You can see a couple Ingestion lines in the middle of the sheet. That is because we discovered a couple additional ingestion problems while looking at something else. We can always re-order the fault list.  Don't stop to be orderly in the first pass. Just capture everything that comes to mind.

Missing functionality

Identified missing modules

We identified that the diagram above is missing a metrics collection box at the beginning of the flow. We want to capture metrics around the number of messages received to match up against the metrics bound to the data lake writer.

Related 

Blog Articles

  • Throwing down failures http://joe.blog.freemansoft.com/2021/01/failure-mode-analysis-ste-one-throwing.html
  • Detection and remediation https://joe.blog.freemansoft.com/2021/02/failure-mode-analysis-step-two.html

Videos

  • Step 1: Throwing down failures https://youtu.be/RdmfSiqsXAs
  • Step 2: Detection and Remediation https://youtu.be/98puya9szWc

Created 2021/01/31

Comments

Popular posts from this blog

Accelerate Storage Spaces with SSDs in Windows 10 Storage Pool tiers

Docker on a Chromebook on Crostini - Neverware CloudReady is ready

Java 8 development on Linux/WSL with Visual Studio Code on Windows 10