Lake Mutability in the Age of Privacy and Regulation

Data immutability was one of the core tenants of Data Lakes when they first became big.  Mutable data went to Relational and Document databases while immutable data and and documents were store in the lake.  

Emerging privacy regulations and data sharing regulations are adding data retention, data visibility and data management rules and behaviors that may drive companies to re-think which data should be stored and how data should be stored in data lakes
Retention requirements over time

Video blog


Phase 1: Data Set Storage Retention

Retention times are are set on the file(s) that make up a dataset. Datasets are managed as files. Entire datasets are removed at the end of the retention period.

Phase 2a: Partition Storage Retention

Retention times are stored somewhere and bound to partition keys.  Data is organized as tables in a a table/partition/file format.  Partitions are based on dates that can be used to approximate a retention date.  Partitions are removed based on the dates.  This is analogous to dropping Database Partitions in an RDBMS.

Phase 2b: Privacy Masking

This is not related to retention but is related to the need for mutable data in a lake.  Data values must be obfuscated or masked to remove potentially sensitive information. Data Lakes must support this on data already in the lake.

Phase 3a: Record Retention by Request

Data retention rules are applied on a per-record basis based on customer requests as part of privacy regulations. Entire data sets may have to be scanned to identify records that must be expunged based on the request.

Phase 3b: Record Retention by Inactivity

Government and corporate regulations may require that user data is terminated within some time period after the customer terminates its relationship with the holder of the data. Data retention rules are applied and records purged abased on user-last-active dates or relationship termination dates.  All files in a Tabular dataset may have to be searched because the activity date calculation may require access to fields on in the partition key.

Future Regulation

Consumer data will become more regulated over time increasing the number of rules around data retention and access.They may also re-think which data goes into data lakes taking into account the cost of remediation and retention.

New regulations include: 
  • Forget my data
  • Remove terminated relationship data
  • Find and remove Personally Identifying Information where it exists

Legacy Retention

Document retention initially relied on storage time to live. This meant retention was driven by document age and not record age or data driven business rules. Many data lakes didn't actually build any document retention or legal hold support knowing that they wouldn't have to actually run retention until years after lake creation.

Created 6/2020

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk