Failure Modes Effects Analysis - FMEA - Step Two - Detection and Remediation

We evaluate the identified possible faults and issues to determine how we can detect the failure and how we can remediate it.   

For this discussion, we will bucket the failure modes into three types which can help us determine how they can be detected.  

We will categorize failures as technical, design time and business types of failures.  We can use the category to determine how we wish to remediate the failures. Some of the business rule failures will be "by policy" and their remediation will be in the business departments. The other failures will be remediated via technical means.

Capturing - Detection and Remediation

We want to fill in the Detection and Remediation columns.  You can tune the meanings of these columns to your use case.  For this walkthrough

We sweep across all the faults to determine how the fault would actually be detected and then how we would permanently, tactically, manually, transiently remediate that. 

Classify how this can be detected by the system or people.  Originally this column held the detection method.  We detected it through log analysis, some monitoring system, a person looking at a report.  For this exercise I'm we want to do that but also determine the category.  So I kind of mixed detection and category here. Maybe this belongs in its own column.
  • Technical: Technical errors are determined via some exception or log message or some type of error event.
  • Business:  A business failure is hopefully based on some business rule or condition they hadn't thought of.  Business failures may be raised by the system, created by rules. 
  • Design: In some cases, we find a fault that exists due to a missed requirement or bad design. It may not be detectable by the system by design.  We need to do a fix in those cases, a redesign to detect and/or mitigate.

We specify the remediation path.  This may be multi-step. You may have multiple steps or lines for each fault in this column. There may be some transient automated remediation, like a retry.  That may be followed up with human intervention.  The best remediation is one that is completely automated.
  • Automatic: Certain types of failures can be automatically remediated.  We may retry a call to a web service or re-send an email.  This may fail over to manual remediation
  • Manual: The human or out-of-process steps that remediate this. This could be a person running a secondary manual correction process or them just creating a report or a person manually pushing data or processes around. It could be them pulling some data or component out of the current path so that the process can continue with other data or components.
  • Capture Metrics: A business fault or distributed system fault may be expected. We still may wish to capture metrics to watch the error rate. We may be able to ignore a business error until it becomes more common that our design expected. 
  • Redesign: Permanent remediation for this type of fault may require a redesign.  That should be added here. 

Video Walkthrough

System Under Analysis

This is our simple system we under analysis.


Blog Articles

  • Throwing down failures
  • Detection and remediation


  • Step 1: Throwing down failures
  • Step 2: Detection and Remediation
  • Garage Door Failure to Close
Created 2021/02
Title Revised 2024/03


Popular posts from this blog

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

DNS for Azure Point to Site (P2S) VPN - getting the internal IPs