Failure Mode Analysis - Step Two - Detection and Remediation

We evaluate the identified possible faults and issues to determine how we can detect the failure and how we can remediate it.   

For this discussion, we will bucket the failure modes into three types which can help us determine how they can be detected.  

We will categorize failures as technical, design time and business types of failures.  We can use the category to determine how we wish to remediate the failures. Some of the business rule failures will be "by policy" and their remediation will be in the business departments. The other failures will be remediated via technical means.

Capturing - Detection and Remediation

We want to fill in the Detection and Remediation columns.  You can tune the meanings of these columns to your use case.  For this walkthrough

We sweep across all the faults to determine how the fault would actually be detected and then how we would permanently, tactically, manually, transiently remediate that. 

Detection
Classify how this can be detected by the system or people.  Originally this column held the detection method.  We detected it through log analysis, some monitoring system, a person looking at a report.  For this exercise I'm we want to do that but also determine the category.  So I kind of mixed detection and category here. Maybe this belongs in its own column.
  • Technical: Technical errors are determined via some exception or log message or some type of error event.
  • Business:  A business failure is hopefully based on some business rule or condition they hadn't thought of.  Business failures may be raised by the system, created by rules. 
  • Design: In some cases, we find a fault that exists due to a missed requirement or bad design. It may not be detectable by the system by design.  We need to do a fix in those cases, a redesign to detect and/or mitigate.


Remediation
We specify the remediation path.  This may be multi-step. You may have multiple steps or lines for each fault in this column. There may be some transient automated remediation, like a retry.  That may be followed up with human intervention.  The best remediation is one that is completely automated.
  • Automatic: Certain types of failures can be automatically remediated.  We may retry a call to a web service or re-send an email.  This may fail over to manual remediation
  • Manual: The human or out-of-process steps that remediate this. This could be a person running a secondary manual correction process or them just creating a report or a person manually pushing data or processes around. It could be them pulling some data or component out of the current path so that the process can continue with other data or components.
  • Capture Metrics: A business fault or distributed system fault may be expected. We still may wish to capture metrics to watch the error rate. We may be able to ignore a business error until it becomes more common that our design expected. 
  • Redesign: Permanent remediation for this type of fault may require a redesign.  That should be added here. 

Video Walkthrough


System Under Analysis

This is our simple system we under analysis.

Related 

Blog Articles

  • Throwing down failures http://joe.blog.freemansoft.com/2021/01/failure-mode-analysis-ste-one-throwing.html
  • Detection and remediation https://joe.blog.freemansoft.com/2021/02/failure-mode-analysis-step-two.html

Videos

  • Step 1: Throwing down failures https://youtu.be/RdmfSiqsXAs
  • Step 2: Detection and Remediation https://youtu.be/98puya9szWc
Created 02/2021

Comments

Popular posts from this blog

Accelerate Storage Spaces with SSDs in Windows 10 Storage Pool tiers

Docker on a Chromebook on Crostini - Neverware CloudReady is ready

Upgrading the HP Chromebook 14 (Falco) SSD