Tokenizing Sensitive Information - PII Protection

The only ways to protect sensitive information is to remove the sensitive values everywhere they are not absolutely needed. Data designers can remove the fields completely or change the field values so that they are useless in the case of data theft. Data tokenization and Data encryption are two possible solutions to this issue. Both approaches must be implemented in a way that they return the same non-PII value for a given PII value every time they are invoked.

We're going to talk about tokenization here. Tokenized field values must be changed in a repeatable way so that the attributes still be useful for joining data in queries or reports. This means every data set with the same value for the same PII field will have the same replaced value.  This lets us retain the ability to join across datasets or tables using the sensitive data fields. 

Let's talk about tokenizing sensitive information in data stores.  

Tokenization Algorithm

All PII attributes must be identified so that token tables can be added to the token/value store for those attributes. All values in fields flagged as PII are tokenized. 

We tokenize sensitive information by replacing that information with a value that acts pointer/key to a value in a secure data store. Each PII field has its own token table. Each PII attribute is given a name.  That attribute name is used as the token table name wherever that PII attribute exists. 
  1. Move PII values to a sensitive data store and create a unique token for each value in saved.
  2. Replace the PII value with that token everywhere it used for that PII attribute.
Every value in a PII column is tokenized. Every value in PII attributes are tokenized. 

Tokenization for Ingestion and Applications

PII and other sensitive information should/must be flagged in the metadata catalog in order to determine which fields need to be tokenized.  Hopefully organizations are already do this so they know how to implement PII and non-PII access controls for sensitive information.

Tokenization is implemented as an API that can be called as part of batch processes, streaming processes or by applications.  This process flow shows an environment where there is a standard ingestion mechanism that tokenizes flagged PII fields.  An organization cold take a different approach where data is tokenized at the source prior to submission to big data stores.
Tokenization happens prior to lake ingestion. 
Tokenization happens prior to storage in operational stores.
Cross-application APIs may use the tokenized values by contract.



Popular posts from this blog

Accelerate Storage Spaces with SSDs in Windows 10 Storage Pool tiers

Docker on a Chromebook on Crostini - Neverware CloudReady is ready

Java 8 development on Linux/WSL with Visual Studio Code on Windows 10