What do you know and why do you know it

What do you know and why do you know it - Lineage for ML

December 22, 2019

Process and decision repeatability and accountability is a requirement for large enterprises and entities operated in regulated industries. Machine Learning decision justification and auditability and privacy related data tracking are two areas pushing organizations to improve the way they track data movement, transformation and usage. This drives the need for Data Lineage tracking and reporting. Organizations have to trade off the ease of creating and capturing data lineage, the amount of data captured and the ease of reporting and auditability Data lineage includes the data origin, what happens to it and where it moves over time.[1] Data lineage information includes technical metadata involving data transformations. [2]

This diagram shows a simple data movement where data originates in one system, is transformed, stored in a database, then transformed again and used by a machine mode. The resulting calculation is then stored again. The small circles call out what needs to be captured for lineage.

The original data source
The retrieval from the original data source, the transformation including version and the storage to the RDB
The retrieval from the RDB, its transformation including version
The sending of that data to the machine model
The version of the machine model
The movement of data into the sink.

Slightly more complicated flow

The next diagram shows a fairly common flow with multiple Feature transforms and a Machine Model. Note the potential for many lineage telemetry capture points. Real systems and models can easily have dozens of data movement, storage and transformation lineage touchpoints.

Computer scientists are always trading off ease of development and maintenance vs performance and data usage.

Static Lineage Capture

Some organizations take a deploy-time approach to lineage in order to retrofit existing systems and processes. This approach registers data source/transformation/sink triplets in a lineage registry whenever new transformations are deployed or data sources/sinks are changed. Data forensics teams can use the registry to figure out which transformations were in play at any given time.

Transformation registrations are updated with every deployment documenting the software version and its location in a immutable repository. The following diagram shows a system where all the lineage is captured via 4 transformation registrations. This lets us determine as-of transformations by querying the Registry with the appropriate date ranges.

In-Line Real-Time Lineage Capture

Some systems take a more dynamic approach capturing all lineage movement and transformation for each record or field in a steaming fashion. This approach trades off reporting or tracing simplicity against significant data streaming and storage costs.

Organizations often create and attach unique identifiers to every record so that telemetry can be tied back to the original record. Telemetry can be stored inside the data records near the data records or in a lineage / telemetry specific data store.

Further discussion on this approach will have to wait for another article.

Blog de Joe Freeman