Plan ahead for internal correlation and tracing needs
YouTube Video
Semi-persistent Tracing and Correlation
Software platforms and libraries have plenty of capabilities for recording
logs and metrics. There are many base libraries that can generate
transaction or request identifiers that can be persisted into operational
dashboards. These identifiers can be traced from the ingestion
endpoint through the system. Teams rarely persist that information
into their data stores as metadata. There is no way to correlate data
into the database against the originating request. This lineage is
usually lost.
Operational observability stores should not contain Personally Identifiable
Information (PII). The data needs to improve enough to support
dashboards and triage without risking data breaches.
The discussion applies to synchronous systems and asynchronous or
event-driven systems. The internal structure in those two models is
logically identical.
Capturing data lineage from the front door to the data store
End-to-end tracing and tracking is essentially a form of
lineage. Our tracing and lineage scope starts where requests are
received and ends with data persisted in a data store. Teams often
have practices around capturing the lastModifiedBy and
lastModifiedAt. Lineage tracing should also be added in the form of
one or more tracing identifiers. They should be something like
lastSpanId or lastTraceId. This makes it possible to
correlate log activity against the data that ended up in the
database.
We treat inbound data like an event stream when it is an actual event stream
and when it is some synchronous API. Both have input payloads and some
type of implicit or explicit command We store those events and their metadata
in some PII secure location. This lets us trace from some input payload
through processing and then into the database.
Inter-Domain Notifications and Lineage
Outbound API calls and Event notifications should include tracing and
lineage identifiers that let us track information as it flows between
systems and beyond. Outbound notifications should be captured and
stored with the lineage/tracking identifiers/metadata. This lets us
correlate the inbound requests the log traffic, the data in our database,
and what we told others about the change.
We store information that helps us with replay, long-term triage, lineage
tracing, and regulatory support. In this case, we store the inbound events,
the data in our database, and any outbound events that were created as a
result of processing the inbound request. All of that data should have
attached metadata that lets us correlate across the lifetime of a processing
request.
Our Internal Bounded lineage/tracing Context
We can treat our system boundaries as our system context for creating and
storing correlation and tracing identifiers to meet our operational needs
and to provide the lineage of data and work moving through our
system. The lineage/tracing metadata is attached to everything we
deliberately persist and everything we share with others.
Tracing across contexts and inside our own context
Implementing balance and controls and tracking lineage across systems and
inside a system is often more difficult in today's microservice distributed
systems than it was in the monolithic past.
Systems must implement tracking and correlation identifiers at the local
level to make it possible to triage activities and bind events with
persisted data in analytical use cases. Retaining transaction/lineage
metadata in the analytical store significantly simplifies query
correlation.
Enterprises should mandate cross-system correlation via NFR. They should
then coerce all programs to implement tracking and correlation identifiers
for all intersystem traffic. Those identifiers should be exposed in all logs
and observability platforms tools. This is the only reasonable way of
validating that a request was received and that work was processed across
the various systems.
Created 2022 10
Comments
Post a Comment