Plan ahead for internal correlation and tracing needs

October 27, 2022

Inter-system and Intra-system tracing capabilities are a must in modern distributed architectures and in systems where dashboards and triage must be done without cracking open the production environment for on-box work. Teams need to understand and verify the lineage of inbound requests that end up in data stores and in outbound calls or notifications to other systems.

Lineage and Observability NFRs are the requirement that creates the need for inter and Intra team correlation capabilities. Everyone should have tracing and monitoring Non-Functional Requirements (NFRs) that describe their observability needs. Those NFRs should describe how a system must support tracking work through a system from the time it enters until the time it transitions to at rest or the time it communicates with other systems. Teams without these NFRs often end up scrambling to provide production metrics and debugging tools during production events.

YouTube Video

Semi-persistent Tracing and Correlation

Software platforms and libraries have plenty of capabilities for recording logs and metrics. There are many base libraries that can generate transaction or request identifiers that can be persisted into operational dashboards. These identifiers can be traced from the ingestion endpoint through the system. Teams rarely persist that information into their data stores as metadata. There is no way to correlate data into the database against the originating request. This lineage is usually lost.

Operational observability stores should not contain Personally Identifiable Information (PII). The data needs to improve enough to support dashboards and triage without risking data breaches.

The discussion applies to synchronous systems and asynchronous or event-driven systems. The internal structure in those two models is logically identical.

Capturing data lineage from the front door to the data store

End-to-end tracing and tracking is essentially a form of lineage. Our tracing and lineage scope starts where requests are received and ends with data persisted in a data store. Teams often have practices around capturing the lastModifiedBy and lastModifiedAt. Lineage tracing should also be added in the form of one or more tracing identifiers. They should be something like lastSpanId or lastTraceId. This makes it possible to correlate log activity against the data that ended up in the database.

We treat inbound data like an event stream when it is an actual event stream and when it is some synchronous API. Both have input payloads and some type of implicit or explicit command We store those events and their metadata in some PII secure location. This lets us trace from some input payload through processing and then into the database.

Inter-Domain Notifications and Lineage

Outbound API calls and Event notifications should include tracing and lineage identifiers that let us track information as it flows between systems and beyond. Outbound notifications should be captured and stored with the lineage/tracking identifiers/metadata. This lets us correlate the inbound requests the log traffic, the data in our database, and what we told others about the change.

We store information that helps us with replay, long-term triage, lineage tracing, and regulatory support. In this case, we store the inbound events, the data in our database, and any outbound events that were created as a result of processing the inbound request. All of that data should have attached metadata that lets us correlate across the lifetime of a processing request.

Our Internal Bounded lineage/tracing Context

We can treat our system boundaries as our system context for creating and storing correlation and tracing identifiers to meet our operational needs and to provide the lineage of data and work moving through our system. The lineage/tracing metadata is attached to everything we deliberately persist and everything we share with others.

There can actually be two sets of correlation ids. Those that are useful only inside the context and those that let us coordinate across context boundaries. In this case, context means system and cross-context means where other systems invoke this one or where this system directly invokes others or sends notifications.

Tracing across contexts and inside our own context

Implementing balance and controls and tracking lineage across systems and inside a system is often more difficult in today's microservice distributed systems than it was in the monolithic past.

Systems must implement tracking and correlation identifiers at the local level to make it possible to triage activities and bind events with persisted data in analytical use cases. Retaining transaction/lineage metadata in the analytical store significantly simplifies query correlation.

Enterprises should mandate cross-system correlation via NFR. They should then coerce all programs to implement tracking and correlation identifiers for all intersystem traffic. Those identifiers should be exposed in all logs and observability platforms tools. This is the only reasonable way of validating that a request was received and that work was processed across the various systems.

Created 2022 10

Blog de Joe Freeman