Tokenizing Sensitive Information - PII Protection
The only way to protect sensitive information is to remove the sensitive
values everywhere they are not absolutely needed. Data designers can remove the
fields completely or change the field values so that they are useless
in the case of data theft. Data tokenization and Data encryption are two possible solutions to this
issue. Both approaches must be implemented in a way that they return the same non-PII value for a given PII value every time they are invoked.
We're going to talk about tokenization here. Tokenized field values must be changed in a repeatable way so that the attributes still be useful for joining data in queries or reports. This means
every data set with the same value for the same PII field will have the same
replaced value. This lets us retain the ability to join across
datasets or tables using sensitive data fields.
Every PII field has a typecode or a key. That type is used whenever any value for that field is tokenized or detokenized. This means you have to create a data dictionary containing all the PII fields that can be tokenized. Applications must use that key whenever they protect a PII field of that type.
Let's talk about tokenizing sensitive information in data
stores.
Tokenization Algorithm
All PII attributes must be identified so that token tables can be added
to the token/value store for those attributes. All values in fields
flagged as PII are tokenized.
We tokenize sensitive information by replacing that information with a value
that acts pointer/key to a value in a secure data store. Each PII field has
its own token table. Each PII attribute is given a name. That
attribute name is used as the token table name wherever that PII attribute
exists.
- Create a categorization for each identified PII field.
- Pass PII values and their field names to the tokenization service.
- The tokenizer saves the PII in its data store
- The tokenizer generates a token or returns an existing token if that value already was saved for that field.
- The requesting process replaces the PII value with that token and persists it to the local data store if required.
|
Data Governance
PII and other sensitive information should/must be flagged in the metadata
catalog in order to determine which fields need to be tokenized.
Hopefully, organizations are already doing this so they know how to implement
PII and non-PII access controls for sensitive information.
Tokenization for Lake Ingestion
Tokenization is implemented as an API that can be called part of batch
processes, streaming processes, or applications. This process flow
shows an environment where there is a standard Data Lake ingestion mechanism that
tokenizes flagged PII fields. An organization could take a different
approach where data is tokenized at the source prior to submission to big
data stores.
Tokenization happens prior to
lake ingestion. Tokenization happens prior to storage in operational stores. Cross-application APIs may use the tokenized values by contract. |
Video
Revision History
Created 2020 07 14
Updated 2023 03 14 - Happy Pi Day
Comments
Post a Comment