Moving Graph data to a Rectangular Analytical Store

Graph style databases can be used to model some data sets in a more intuitive fashion. This lets us leverage the structure for special purpose applications and data exploration. 

Standard analytics tools often work best against relational record-oriented data.  There are a couple strategies for moving graph modeled data to relational stores for analytics.

We will walk through some very basic options. There are more sophisticated approaches than what we will discuss

Video - YouTube

The Initial Graph DB

Here we have a simple graph modeling a person with their various addresses and phone numbers.  We have modeled two different address types, mailing, and billing.  The model supports multiple phone numbers but does not break them out by type and purpose.
Each of the graph node types can have additional business and technical properties.  The Person might have FirstName, LastName, last modified at.

The relationships or edges can also have additional properties.  They might all have something like added_at or originated_from or an active/inactive flag.

Fit for reporting tools

Reporting data tends to be flattened and denormalized to optimize query performance.  our goal is to take the graph/mesh and convert it into denormalized tables.  Each table may be populated from multiple relationships/edges and multiple Vertexes.

Extracting per Root-Object

It seems like the most graph-centric way to do this is to extract each root object and all related data.  A query would retrieve the person and then navigate across all the primary and possibly secondary relationship arcs. This will result in a document with embedded to-one and to-many related sub-documents.  

This example creates a single record in a single table for each person.
The extracted data contains some portion of the hierarchy of the original graph database schema.  The picture only shows graph nodes no more than one away from the primary object.  Our query would extend N deep across relationships.

Extracting Sub-Graphs

There are variations of the "root object query" extract.  One example is that we still have a root object but we run a query to retrieve all the Vertex across individual relationships from the root object. In this example, we end up with 3 persons each linked to either an address or a phone.  We would have three relatively flat data records for our person.

This example creates 3 people for each person in the graph database. They could be unified via query or a view on top of the data.

The extracted data is flatter than in the previous but still may have embedded hierarchy if the Vertex or Edge elements support document-oriented properties.

Extracting Vertex/Edge entities

The simplest way of modeling and extracting the graph to a relational store is to treat each Vertex and Edge type as their own relational table.  We ignore the meaning of the graph data and do mechanical extractions. The extract runs multiple simple queries, instead of running complex root-node based queries.

We can run data based query on each entity type retrieving any data modified since the last run.
The extracted data is already in a shape that is mostly ready for a relational database.  Some entities have embedded collections of properties. Those could be stuffed in their own columns or modeleled some other way.

1) Tables Shaped like Graph

This shows the tables that result from an entity based extract. Each graph Edge and Vertex type becomes its own relational tables.

2) De-Normalization for Presentation Model

The final stage is the de-normalization of the graph-extract.  Tables and views are created that are optimized for the consumer-oriented schema.  

3) Transformation Pipeline



Implementation Notes

Teams may adjust their approach based on the graph language tools available. My experience is based on using gremlin queries to extract individual Vertex and Edge collections.

Teams should consider adding mechanical dates like last_updated_at to every Vertex and Edge type.  This can make it easier to run extraction queries if CDC mechanisms are inadequate.

Comments

Popular posts from this blog

Installing the RNDIS driver on Windows 11 to use USB Raspberry Pi as network attached

Understanding your WSL2 RAM and swap - Changing the default 50%-25%

Almost PaaS Document Parsing with Tika and AWS Elastic Beanstalk