Tokenizing Sensitive Information

Tokenizing Sensitive Information - PII Protection

July 14, 2020

The only way to protect sensitive information is to remove the sensitive values everywhere they are not absolutely needed. Data designers can remove the fields completely or change the field values so that they are useless in the case of data theft. Data tokenization and Data encryption are two possible solutions to this issue. Both approaches must be implemented in a way that they return the same non-PII value for a given PII value every time they are invoked.

We're going to talk about tokenization here. Tokenized field values must be changed in a repeatable way so that the attributes still be useful for joining data in queries or reports. This means every data set with the same value for the same PII field will have the same replaced value. This lets us retain the ability to join across datasets or tables using sensitive data fields.

Every PII field has a typecode or a key. That type is used whenever any value for that field is tokenized or detokenized. This means you have to create a data dictionary containing all the PII fields that can be tokenized. Applications must use that key whenever they protect a PII field of that type.

Let's talk about tokenizing sensitive information in data stores.

Tokenization Algorithm

All PII attributes must be identified so that token tables can be added to the token/value store for those attributes. All values in fields flagged as PII are tokenized.

We tokenize sensitive information by replacing that information with a value that acts pointer/key to a value in a secure data store. Each PII field has its own token table. Each PII attribute is given a name. That attribute name is used as the token table name wherever that PII attribute exists.

Create a categorization for each identified PII field.
Pass PII values and their field names to the tokenization service.
The tokenizer saves the PII in its data store
The tokenizer generates a token or returns an existing token if that value already was saved for that field.
The requesting process replaces the PII value with that token and persists it to the local data store if required.

Every value in a PII column is tokenized. Every value in a PII attribute field is tokenized.

Data Governance

PII and other sensitive information should/must be flagged in the metadata catalog in order to determine which fields need to be tokenized. Hopefully, organizations are already doing this so they know how to implement PII and non-PII access controls for sensitive information.

Tokenization for Lake Ingestion

Tokenization is implemented as an API that can be called part of batch processes, streaming processes, or applications. This process flow shows an environment where there is a standard Data Lake ingestion mechanism that tokenizes flagged PII fields. An organization could take a different approach where data is tokenized at the source prior to submission to big data stores.

Tokenization happens prior to lake ingestion.
Tokenization happens prior to storage in operational stores.
Cross-application APIs may use the tokenized values by contract.

Video

Revision History

Created 2020 07 14

Updated 2023 03 14 - Happy Pi Day

Blog de Joe Freeman