Lake Mutability in the Age of Privacy and Regulation
Video blog
Phase 1: Data Set Storage Retention
Retention times are are set on the file(s) that make up a dataset. Datasets are managed as files. Entire datasets are removed at the end of the retention period.
Phase 2a: Partition Storage Retention
Retention times are stored somewhere and bound to partition keys. Data is organized as tables in a a table/partition/file format. Partitions are based on dates that can be used to approximate a retention date. Partitions are removed based on the dates. This is analogous to dropping Database Partitions in an RDBMS.
Phase 2b: Privacy Masking
This is not related to retention but is related to the need for mutable data in a lake. Data values must be obfuscated or masked to remove potentially sensitive information. Data Lakes must support this on data already in the lake.
Phase 3a: Record Retention by Request
Data retention rules are applied on a per-record basis based on customer requests as part of privacy regulations. Entire data sets may have to be scanned to identify records that must be expunged based on the request.
Phase 3b: Record Retention by Inactivity
Government and corporate regulations may require that user data is terminated within some time period after the customer terminates its relationship with the holder of the data. Data retention rules are applied and records purged abased on user-last-active dates or relationship termination dates. All files in a Tabular dataset may have to be searched because the activity date calculation may require access to fields on in the partition key.
Future Regulation
- Forget my data
- Remove terminated relationship data
- Find and remove Personally Identifying Information where it exists
Comments
Post a Comment