Call Recordings and other Binary Data and Metadata in the Data Lake

Data lakes hold data in any format.  This includes structured data, semi-structured text data, documents, and binary data. Organizing that binary data and its metadata can be done in several ways.

Video

Images in Video

Welcome

We're talking about binary data and its associated descriptive metadata.  
This shows some of the metadata that could be associated with each call recording.
The recording itself is highly sensitive because we don't know exactly what was said.  
The extracted text is also highly sensitive because it is a full text copy with the same risk.

Media / binary files can add up.  
We could have millions of call records and all of their associated metadata.
It is a large data problem.

We have to pick the format for the binary, non-rectangular, data and its associated metadata.
We can use the native formats and links or embed the binary data inside another format.

Here are two of the major options.
Binary data can be stored in its native format with the metadata stored separately.
Binary data can be embedded in a combined data/metadata record in some other cloud structure.

Here we list out some of the advantages and disadvantages of using native formats for binaries and binding the binaries to the metadata through pointers or some other mechanism.
Note that this approach also works when you want one copy of the binaries pointed to by different storage environments like lakes and warehouses.

Here we list out some of the advantages and disadvantages of embedding the binary inside a cloud-native cloud format like ORC or Parquet.  In this case, the binary might be embedded in a column in a record with the metadata.

Think about your requirements when picking your storage model.


Created 021 Jun 10


Comments

Popular posts from this blog

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

DNS for Azure Point to Site (P2S) VPN - getting the internal IPs