Saturday, September 29, 2012

Anti-Affinity for In-Memory Caches or No SQL Data Stores

In-Memory databases and caches, either  SQL or no-SQL, are prized for the high performance and high reliability. They scale up by increasing the number of cache nodes to either partition the data for larger data sets or to replicate the data to support more parallel operations.  Systems often grow to be hundreds of nodes with some providing almost linear growth versus CPU count.    Best practices say to keep  duplicate or redundant copies of data in multiple nodes.  Disaster Recover (DR) is often integrated through WAN data replication features in the enterprise class versions of these products where data is pushed to alternative data centers.

Cluster node management and other configuration concerns can can have a big impact on reliability and performance.   Teams should strive to make sure replicated copies of data are as far apart as possible, that they have a low affinity to each other. Teams who are told "you don't care where the servers are" should be skeptical of these clams because modern deployment methods including Virtual Machines and Cloud deployments cloud the actual topology creating additional challenges.   Anti-affinity is the idea that duplicate data should be far enough apart to protect mission critical systems systems.

Sample No-SQL Schema

The rest of this article uses a simple airline schema. There are three regions/tables. 
  • Customers: Customer data is replicated across all nodes.
  • Flights: The airline flight schedule is replicated across all nodes.
  • Reservations with Seat Assignments: This larger table is partitioned into two parts.  This means at least two caching nodes are required to hold all the data. 
The following depicts the minimum set of two nodes required to describe the data.


Different Topologies

The previous diagram shows a partitioned data store with 1/2 of the reservations/seat data in each node. This means 1/2 the data becomes unavailable if a note fails.  Most systems are configured with at least two copies of each piece of data exists in the cluster at all times. This means the sample data store requires at least 4 nodes to support a two part partitioned table.  Each node in the previous diagram would be duplicated in this case.

Inside-the-enterprise cache/memory-db products are scaled up by standing up additional instances one instance per physical or virtual machine.   Microsoft products like AppFabric are limited to this deployment model. Most Java and other non-os bound products, like Gemfire and Coherence have the option of running multiple server instances within the same operating system. This is useful when running on physical hardware where you want to limit the instance size of each node.  Backup and recovery time increases with the size of the instance.  Smaller instances backup and migrate faster than larger ones. Users may wish to run on virtual machines if they are limited to one node instance per machine. They may pay a bigger wastage penalty because of the replicated O/S installs and memory usage but some consider that less of an issue with memory becoming cheaper all the time.

Single Instance Machine Deployments

The following picture shows a physical deployment where each data region/table has at least two redundant copies.  The reservation data is partitioned into two parts so you we end up with (1+2)*2 or 6 nodes.  Each machine / OS contains a single node.


A single node failure results in zero data loss while retaining data redundancy.  This works for all java products that I know of.  In most cases the machines can be configured differently. Microsoft MSDN guidance says MS only supports AppFabric when all of the physical machines have identical CPU, memory and other components.  You should verify that your vendor supports slightly different physical machines which are often a result of purchases made at different times.  

Multiple Instance Machine Deployments

The following picture shows a physical deployment where each region/table has at least two redundant copies. The partitioned data is partitioned across server instances that are housed in the same physical (or virtual) machine. This means one copy of the O/S supporting both instances.


This works all the servers I have worked with other than Microsoft App Fabric.   I've successfully used multiple instances / services with up to 5 service nodes per machine across 5 machines.  Some folks like the higher level of isolation provided by single instance / server and that is often recommend in a Virtual environment.

Operational Issues

Virtual machine Distribution Matters

Operations folks will allocate virtual machines across physical machines almost completely based on load and capacity.  They also need to take into account data co-location when working with in-memory cache stores or databases.  Replicated data stores, regions, tables and partitions, should be distributed across independent physical servers.  The following diagram shows how this would work for our reservation system with three replicated copies of the two partitions of flight data. Each data region/table will still be available during the time between when a physical server fails and the VMs re-appear on another physical host assuming some sort of VM failover is in place.  This is even more true on Virtual Infrastructure without HA/FT capabilities


The following diagram shows an example of a bad virtual machine allocation plan.  A single hardware failure eliminates all the redundant copies of a partition of the reservation table/region.


Virtual Machine allocation becomes more complex with higher node count systems and increases in physical server size which increases the number of VMs per physical machine.

Anti-Affinity

Normally we want to co-located related data to improve join/query performance.  Replicated data stores/regions/tables are a subtly different  We want the replicated tables/regions to be as isolated from each other as makes business sense.  The resulting mapping exercise complicates system administration.

Some systems support anti-affinity through direct configuration manipulation and others try to do this internal.  Gemfire has a "prefer unique hosts" behavior that puts redundant data on different machines.  It also works when using virtual machines in an ESXi environment locating virtual machines on different physical hardware.  Gemfire also supports redundancy zones that let you carve up your servers by address, location or some other parameter.  Gemfire servers are told which redundancy on boot up and the cluster distributes data across the zones.  Auto-load balancing, using VM migration, should be avoided in these situations.

Leveraging Advanced Virtual Infrastructure

The affinity problem can be solved or, at the least, reduced through the use of HA/FT software integrated into the virtual infrastructure.  VSphere has the ability to migrate virtual machines from one physical machine to another when it detects a hardware or other problem on the original physical machine.  This failover takes some amount of time.  They also have a mode where they mirror a virtual machine's state to a non-active VM.  Then there is no need to migrate the VM because the live copy is always ready to go. Neither of these solve the problem where the Guest O/S is fine but  the data server software itself crashing. There are virtual infrastructure integration points.  Some of those might be able to monitor application health and restart the application or the virtual machine. 


No comments:

Post a Comment