Monday, October 3, 2016

Success and Failure in the world of Service Calls and Messages


Victory has 100 fathers. No-one wants to recognize failure.    The Italian Job
Developers and designers spend a fair amount of time creating API and Service contracts.  The primary focus tends to be around happy path API invocation with a lot of discussion about the parameters and the data that is returned.  Teams seem to spend little time on the multitude of failure paths often avoiding any type of failure analysis and planning.

At the API level: Some groups standardize on void methods that throw Exceptions to declare errors. The list of possible exceptions and their causes is not documented.  Exceptions contain messages but no structured machine readable data.

At the REST web service level: Some groups standardize on service return values other than 200 for errors. They do not document all of the HTTP return codes they can generate or their meanings.  API consumers are left to their own devices to figure out who owns failures and how they should handled.

Note:  I get a special kind of heartburn when discussing APIs folks who have no concept of partial success or success with soft errors.  Yeah, CRUD means you commit or you don't.  More complex applications have more subtle nuances.

Infrastructure Integration

We can do this on a per service basis or try and come up with some type of standard.  This is generally a good idea and makes it possible to plug behavior into infrastructure components without them having to know any details of the invocation or business process.
  • Can you build a system that only looks at the envelope (headers, return codes)?
  • Does it work for impartial systems like service buses or API routers?
  • Can we create a standard that lets us get statistics from cloud components like Load Balancers

Conceptual Baseline

Let's take broader view of service invocation and define possible end states for API invocation. APIs can end in success, partial-success fatal failure, recoverable failure and possible other states. Failure can be due to technical issues, bad code, defects, business rules or broken business processes. Failures and errors must be owned by someone either from the business or by some technology team.  

Lets use the following diagram as a starting point. We divide failure based on the owner of the triage and remediation.


Success

CompletePartial / Soft Errors
DB inserts should probably return a transaction receipt or the key to the updated data or the URL to retrieve the modified data. Success may return confirmation codes, correlation IDs, operation codes, text messages or message parameters. Some of these are used for audit and some are used to build a better user experience. This one seems to give some folks stomach acid problems.  They view success as something absolute. That is true for DB operations This may not be true for any type of call that can have soft errors where the back-end service can be partially successful or blocked in some way that doesn't immediately impact the caller.  We'll talk about the meaning of "success" later.


Business Failure

Business users own the rules, triage and repair of business failures.  Some business failures are common and may be handled through normal application behavior or some type of rework/manual business processes., In other cases they may they may rely on the technical team to acquired failure details and to apply a fix.  The best case is that business users can fix, restart AND terminate the business processes themselves.

Business rules are a normal part of computer programs.  Business rules may terminate an operational request in response to a business rule failure.  This is not a Business Failure for the purposes of this discussion if it is an expected type of behavior.

Business processes can terminate because of a business rule that may actually pass at some time in the future.  This could be due to a asynchronous workflow some type of failed requests or behavior related to soft edits or something else.Business processes can terminate because of unexpected data or state.  They may fail deliberately due to policy rules. 

Retry ReadyBusiness Triage
Retry Ready. Business Failures and Technical Failures may be indistinguishable from each other. Some business failures can be immediately retried or retired some time later. There are a whole range of reasons a business transaction can fail. Some business processes  fail, routing work to application error handlers or some type of manual rework queue. There may or may not be tools to fix data found in the Business triage when the business failure was unanticipated..


Technical Failure

IT users own the triage and fixing of technical failures.  Systems and monitoring needs to be built assuming there will be failures. These types of failures are often due to design defects, dependency problems, infrastructure issues or version mismatch. IT should detect and automatically fixes errors before business users know his is an issue.

Poison data failures may be either Business data or technical in nature.  Some technical triage may be required to determine the owner of service call failures that fail on retry.

Retry ReadyTechnical Triage
processes can fail technically because of network issues, remote system problems, resource constraints, asynchronous timing issues or other reasons.   Systems should build in automated retry that handles the bulk of these situations. Remote system exceptions, network connectivity,, unknown hosts poison messages and failed retires can force triage by the development or technical support teams.  Teams need to build in the logging and message captured needed to determine the causes of technical failures. 

Retry and Triage queues

Asynchronos messsaging make great resources for implementing automated retries and Technical and Business rework queues.  

Two Computers Meet in a Bar

  • How do they know if they are successfully communicating?
  • What tells them if they their conversation is succeeding?
  • Can they partially successful?
  • Who owns failure?
  • Who owns a processing failure

Are We Successful?

  • When another system receives the message?
  • When another system accepts the message?
  • When another system returns success?
  • When some other part of the processing is delayed?

Did We Fail?

  • When a business rule finds an error?
  • When there are soft errors?
  • When a 3rd party is down and we use default behavior?
  • When the receiver has to retry?

Plan Ahead

Owning failures is painful.  I've worked with teams where no one owned failure. Production issues were discovered by users and handled via email.  Everything was reactive. We just sort of planned as if our system could never have problems.  This was a ridiculous approach.
  • Identify Success Scenarios including partial success
  • Identify Business Failure Scenarios and whether they can be remediated by people or systems
  • Identify Technical Failure Scenarios and whether they can be remediated by people or systems

  • Build monitoring, statistics and remediation into the first release
Created 2017 Oct 3
Last Edited 2017 Oct 4
Added AutoHandle 2017 Oct 5

No comments:

Post a Comment