Saturday, September 29, 2012

Anti-Affinity for In-Memory Caches or No SQL Data Stores

In-Memory databases and caches, either  SQL or no-SQL, are prized for the high performance and high reliability. They scale up by increasing the number of cache nodes to either partition the data for larger data sets or to replicate the data to support more parallel operations.  Systems often grow to be hundreds of nodes with some providing almost linear growth versus CPU count.    Best practices say to keep  duplicate or redundant copies of data in multiple nodes.  Disaster Recover (DR) is often integrated through WAN data replication features in the enterprise class versions of these products where data is pushed to alternative data centers.

Cluster node management and other configuration concerns can can have a big impact on reliability and performance.   Teams should strive to make sure replicated copies of data are as far apart as possible, that they have a low affinity to each other. Teams who are told "you don't care where the servers are" should be skeptical of these clams because modern deployment methods including Virtual Machines and Cloud deployments cloud the actual topology creating additional challenges.   Anti-affinity is the idea that duplicate data should be far enough apart to protect mission critical systems systems.

Sample No-SQL Schema

The rest of this article uses a simple airline schema. There are three regions/tables. 
  • Customers: Customer data is replicated across all nodes.
  • Flights: The airline flight schedule is replicated across all nodes.
  • Reservations with Seat Assignments: This larger table is partitioned into two parts.  This means at least two caching nodes are required to hold all the data. 
The following depicts the minimum set of two nodes required to describe the data.


Different Topologies

The previous diagram shows a partitioned data store with 1/2 of the reservations/seat data in each node. This means 1/2 the data becomes unavailable if a note fails.  Most systems are configured with at least two copies of each piece of data exists in the cluster at all times. This means the sample data store requires at least 4 nodes to support a two part partitioned table.  Each node in the previous diagram would be duplicated in this case.

Inside-the-enterprise cache/memory-db products are scaled up by standing up additional instances one instance per physical or virtual machine.   Microsoft products like AppFabric are limited to this deployment model. Most Java and other non-os bound products, like Gemfire and Coherence have the option of running multiple server instances within the same operating system. This is useful when running on physical hardware where you want to limit the instance size of each node.  Backup and recovery time increases with the size of the instance.  Smaller instances backup and migrate faster than larger ones. Users may wish to run on virtual machines if they are limited to one node instance per machine. They may pay a bigger wastage penalty because of the replicated O/S installs and memory usage but some consider that less of an issue with memory becoming cheaper all the time.

Single Instance Machine Deployments

The following picture shows a physical deployment where each data region/table has at least two redundant copies.  The reservation data is partitioned into two parts so you we end up with (1+2)*2 or 6 nodes.  Each machine / OS contains a single node.


A single node failure results in zero data loss while retaining data redundancy.  This works for all java products that I know of.  In most cases the machines can be configured differently. Microsoft MSDN guidance says MS only supports AppFabric when all of the physical machines have identical CPU, memory and other components.  You should verify that your vendor supports slightly different physical machines which are often a result of purchases made at different times.  

Multiple Instance Machine Deployments

The following picture shows a physical deployment where each region/table has at least two redundant copies. The partitioned data is partitioned across server instances that are housed in the same physical (or virtual) machine. This means one copy of the O/S supporting both instances.


This works all the servers I have worked with other than Microsoft App Fabric.   I've successfully used multiple instances / services with up to 5 service nodes per machine across 5 machines.  Some folks like the higher level of isolation provided by single instance / server and that is often recommend in a Virtual environment.

Operational Issues

Virtual machine Distribution Matters

Operations folks will allocate virtual machines across physical machines almost completely based on load and capacity.  They also need to take into account data co-location when working with in-memory cache stores or databases.  Replicated data stores, regions, tables and partitions, should be distributed across independent physical servers.  The following diagram shows how this would work for our reservation system with three replicated copies of the two partitions of flight data. Each data region/table will still be available during the time between when a physical server fails and the VMs re-appear on another physical host assuming some sort of VM failover is in place.  This is even more true on Virtual Infrastructure without HA/FT capabilities


The following diagram shows an example of a bad virtual machine allocation plan.  A single hardware failure eliminates all the redundant copies of a partition of the reservation table/region.


Virtual Machine allocation becomes more complex with higher node count systems and increases in physical server size which increases the number of VMs per physical machine.

Anti-Affinity

Normally we want to co-located related data to improve join/query performance.  Replicated data stores/regions/tables are a subtly different  We want the replicated tables/regions to be as isolated from each other as makes business sense.  The resulting mapping exercise complicates system administration.

Some systems support anti-affinity through direct configuration manipulation and others try to do this internal.  Gemfire has a "prefer unique hosts" behavior that puts redundant data on different machines.  It also works when using virtual machines in an ESXi environment locating virtual machines on different physical hardware.  Gemfire also supports redundancy zones that let you carve up your servers by address, location or some other parameter.  Gemfire servers are told which redundancy on boot up and the cluster distributes data across the zones.  Auto-load balancing, using VM migration, should be avoided in these situations.

Leveraging Advanced Virtual Infrastructure

The affinity problem can be solved or, at the least, reduced through the use of HA/FT software integrated into the virtual infrastructure.  VSphere has the ability to migrate virtual machines from one physical machine to another when it detects a hardware or other problem on the original physical machine.  This failover takes some amount of time.  They also have a mode where they mirror a virtual machine's state to a non-active VM.  Then there is no need to migrate the VM because the live copy is always ready to go. Neither of these solve the problem where the Guest O/S is fine but  the data server software itself crashing. There are virtual infrastructure integration points.  Some of those might be able to monitor application health and restart the application or the virtual machine. 


Thursday, September 27, 2012

Scaling Out Data Using In Memory Stores/Caches

Standard App Topology

This diagram shows a basic web app with two app servers.  It is a fairly straightforward architecture where the two app servers are used for load balancing and fault tolerance in case one server goes down.  A single data server or cluster is sufficient. Most  folks would probably actually have a web server tier between the clients and the App Instance tier but I'll skip it so some of the MS folks feel at home.  

This diagram shows what happens when you start to increase the number of clients.  One of the beauties of the commodity server architecture is that you can continue to add application server instances to handle the additional load. Data servers can be clustered but there is a finite limit to the data tier size and performance.  You end up with an inverted pyramid where you application is teetering on top of a narrow data tier.

NoSQL databases accelerate applications in two ways.  They increase the performance of the data tier by leaving the data in the applications native data model.  This means data, once put into the cache, can be directly used by the application instead of constantly joining multiple tables and marshaling the data as required by an RDBMs.  This reduces the work done to create your data structures.  You can also make use of the fact that inexpensive large-memory machines are readily available making it a much cheaper way of scaling out the database when compared to the expensive hardware of high end databases.  Systems using in memory shared-nothing databases that are designed to scale to Internet type volumes and spikes and can add capacity relatively easily.  The main benefits are:
  • More Storage is available by aggregating RAM / Storage across nodes.
  • Higher Read Throughput , I/O because RAM is faster than any Disk including SSD
  • Higher Write Throughput, I/O because RAM is faster than any Disk including SSD
  • Higher Transaction throughput by making use of more CPUs/Cores
  • Lower Latency because more service nodes are available to service independent requests
  • No data assembly overhead because it is already in the desired structure in the DB
No-SQL are amazingly fast at primary key based retrieval.  Systems like Gemfire also support attribute indexing of arbitrary attributes inside the graph graph. This makes the system fast for both key based retrieval and index based searches. Partitioning and Replication also helps to improve performance over single machines or small cluster environments.

PartitioningReplication
  • Scale-up by sub-dividing Region (Table) across nodes. 
  • Performance by allowing parallel queries each against a subset of the data. 
  • Performance by isolating queries to a single node leaving other nodes free for other queries. 
  • Can be combined with replication for fault tolerance 
  • Caveat: Join-to data must be in same node.
  • Reliability through redundant data copies 
  • Performance by allowing parallel queries against same Region on different nodes. 
  • Maximum single-node size limits maximum data set size 
  • Can be combined with partitioning for larger data sets



Scenarios

Session State

This is the most basic of all uses for in-memory data servers.  High volume web applications can use a high performance backing store to hold session data.  They save the session data to cache on every page request allowing application servers to only hold in-flight session data at any given time.  This reduces the server memory footprint making system memory management easier.  Extracting session data from the application nodes simplifies session failover and load balancing because user requests can be routed to any server with no loss of state. Most session caches keep redundant copies of all data lessening the chance of user data loss.

Data Accelerator

The in-memory data store sits between the application and database.  Read operations first check the cache which fetches the data from the underlying RDBMs if there is a cache miss. Data is saved to the database through the cache on a write-through or write-behind basis. This is really an in-memory cache more than a system-of-record data store.  In-memory stores like Gemfire can be integrated into persistence models like Hibernate where it can act as an L2 cache.  Some applications will use custom caching behavior. I worked on a system where we wrote custom code to IBM MQ calls as if it were a database.  In our case, reads from MQ were cached and writes caused a cache flush.

Consistent Data Store

This is a little harder to describe.  The idea is that Gemfire and other in-memory data stores are designed to synchronize their data across many redundant or replicated nodes.  This means you can scale up the data store to hundreds of nodes rather than 3-4 in a typical database.  The system can also synchronize data operations to the database in either a write-through or write behind fashion.   Some large data shops use the the memory store as their primary transaction database.  They use the RDBMS for reporting or as a warm and fuzzy safety net that implements their traditional data management policies.  Cluster replication to remote data centers is simpler and higher performing that database replications so it can also be used to fill DR databases in other data centers.


High Performance

High performance is attained through the use of redundant data nodes, through the partitioning of the data store and through the use of in-sutu or "in place" processing.  Data replication means the data exists on multiple nodes. The database can spread queries out across the nodes. Partitioning means the data is subdivided across multiple nodes.  Queries can run in-parallel across the nodes and then aggregate the results or they can be isolated to individual nodes, accessing only a subset of the data, based on the partition keys.  Both approaches provide huge boosts in performance often almost linear with the number of nodes available.  Most of the servers have the ability to run code inside the servers so that the processing is co-located with the data.  This is better than stored procedures or something like Oracle's embedded java because the code can operate on the data model's native level.  Tools like Gemfire are written in Java and can execute Java code local to each node.  Other systems often have similar concepts.

Multi-Master & Disaster Recovery

High speed in-memory data stores support WAN gateway and multi-cluster configurations.   This makes it possible to bring data and applications closer to users by spinning up geographicaly dispersed data centers in an all data centers live configuration. Data can be updated in any data center and is replicated to the others. The same feature can be used to replicate data to a Disaster Recovery site without any complex configuration or additional tooling. Tools like Gemfire support partial region/table based replication so different clusters can have filtered views of the data. 

Common Misconceptions

  1. Caches must be in-process with the application to provide any performance boost. Using local in-process caches provides the best performance but limits the size of the data and complicates system tuning and analysis.  Using local on-box caches provides the second best performance and is used by many teams.  It is shied away from by teams that want tier isolation.  Standalone cache node clusters provide a lot of flexibility and simplify scale-out.  Most large data implementations run stand-alone cache servers and get very high performance.
  2. All of the benefits come from being in-memory. The data stores support high speed key based retrieval of arbitrary and potentially complex data models.  Overhead can be significantly be reduced by removing the need to map between relational and application data models.
  3. Reliable Data Stores Require Complex and Expensive Infrastructure.  In memory data stores/caches can run on relatively inexpensive commodity hardware.  Many of them can be distributed and run as part of the normal enterprise application deployment process.  Systems like Gemfire are distributed as Java Jar files making them "run anywhere" servers.  Systems like Microsoft AppFabric are more heavy weight requiring OS installs and configuration making deployment and re-deployment significantly more complicated.

Quick Notes on Gemfire

VMWare Gemfire , formerly Gemstone Gemfire, has these features:
  • Java Based In-memory Data Store or cache
  • It supports Arbitrary data structures, object graphs and simple RDB like flat data
  • Wicked fast key based retrieval of arbitrarily shaped data
  • Reliability through Data Replication and WAN gateway replication
  • Performance scale-up through replication
  • Data scale-up through Data Partitioning
  • Mulitple Regions or Tables in the same server each of which can have different data policies.
  • The ability to mix and match partition and replication strategies

Components and Topoligy

The standard components include
  • Locator A process used by clients to find server nodes.  Directory assistance for finding nodes that have the data or nodes that are available for service.
  • Server An actual data management process. Cluster store their data across multiple Server Nodes.  Servers can save data to a RDBMs in a write-through fashion or they can persist their data to a local disk.  The latter is often used where too much data exists to stream to a single data store.
  • Client Consumers of the cache data.  They can be straight clients or they can be caching clients where the Gemfire client code can be configured to retain data local to the client process for performance tuning
  • WAN Gateway The feature or process that supports replication of data across data centers or clusters.  This also provides the hooks for Database write-behind, essentially data replication to a database instead of other cluster
  • JMX Agent A special process that makes cluster data available to outside monitoring tools through the Java JMX itnerface.
The following diagram shows a two cluster Gemfire configuration where each cluster has two servers. Each cluster has two regions, one partitioned across the servers and one replicated across the servers.  The partitioned data might be something like trip reservations while the replicated data might be the customer information that essentially joins to the reservations.  The two clusters are replicated across the WAN gateway which is also doing write-behind to the local RDBMS.


Simple Example

A two node cluster , Node A and Node B.  There is customer data and flight Information about the scheduled flights. Both customer and flight information exist on Node A and Node B.  There is a third data set which is a list of reservations and seat assignments.  This is essentially the join of customer and flights data that represents a reservation. There are a lot of reservations because the flights occur on a daily basis so we partitioned the reservation / seat information so that 1/2 is on each Node A and Node B.  Key based retrieval is of course always possible but this schema makes it possible to do very fast queries or retrieval based on joins between the three tables.  Partitioning means that the two nodes can each search 1/2 the flight data simultaneous to each other. The picture doesn't show the locators required for cluster function.





Saturday, September 1, 2012

MSP430 Fading and other long running activities using TwoMsTimer instead of delay()

This video shows a simple test TI Launchpad test bed with an RGB LED tied to three PWM pins and the red LED that comes standard. The color fading and red LED flashing are independently controlled  with the code described below.

Source Code

Some folks just can't wait to open the source.  Here is the TwoMsTimer source along with the demo program used in this blog.

Overview

Micro-controller projects often include a mix of parallel tasks.  Handling those tasks with the right priority or interleaving can be a challenge when some of the tasks happen over long periods of time. Long is relative to the execution speed of the CPU in this case.

The simplest way to handle long running activities in a micro-controller is through the use of a loop that contains some type of delay to limit the loop's speed. One of the main problems with this type of code is that the  main loop() or central routines are blocked from doing anything else while in the delay block. The following code, written using the Energia development system for the MSP430 processors shows a system fading 3 LEDs on and off in sequence.

void loop()  { 
  while (true){
    for ( int i = 0; i < 3; i++){
      // fade in from min to max in increments of 5 points:
      // fade out from max to min in increments of 5 points:
      for(int currentBrightness = 255 ; currentBrightness >= 0; currentBrightness -=5) { 
        // sets the value (range from 0 to 255):
        analogWrite(pwmPins[i], currentBrightness);         
        // wait for 30 milliseconds to see the dimming effect    
        delay(30);                            
      } 
    
      for(int currentBrightness = 0 ; currentBrightness <= 255; currentBrightness +=5) { 
        // sets the value (range from 0 to 255):
        analogWrite(pwmPins[i], currentBrightness);         
        // wait for 30 milliseconds to see the dimming effect    
        delay(30);                            
      } 
    } // end pin loop
  } // end while
}
Lets say we want to do something besides fade the LEDs. We could insert other statements inside the for() or while() loop but we'd still end up with 60msec of "no activity" while we wait for the delay to finish.  Alternatives for handling this include
  1. Create some spin-loop-activity-delay method that can be called instead of the standard delay().  
  2. Create a timed interrupt that handles the long running task, fading, leaving the main loop for other activities. 
  3. Create a timed interrupt for other activities leaving the fading in the main loop.
For the purposes of this demo, I'm going to pick option 3 and flash an LED outside of the main loop using the TwoMsTimer library. Timed interrupts provide a way of taking action on relatively precise intervals without having to write complex loop timing / adjustment code. The code flashes a 4th LED one time per second without changing the main code loop. This code interrupt handler based on the TwoMsTimer library written for the MSP430. Notice that the TwoMsTimer function hides all the interrupt setup and interrupt vector code.  TwoMsTimer executes any function at up to 500 times a second. The maximum execution rate depends on the length of time it takes to execute the function and what percentage of the cpu needs to be reserved for other tasks that execute in the main run loop(). This code causes TI Launchpad LED1  to flash once per second independently the loop() code shown above.


void setup()  { 
  // open the hardware serial port
  Serial.begin(9600);
  pinMode(heartbeatPin, OUTPUT);
  TwoMsTimer::set(500, flash); // 500ms period
  TwoMsTimer::start();
} 

// interrupt handler passed to TwoMsTimer
void flash(){
  // log every time the handler is called should be every 500msec
  Serial.println("flash");
  heartbeatPinOn = !heartbeatPinOn;
  digitalWrite(heartbeatPin,heartbeatPinOn);
} 

Hardware

This project uses the TI Launchpad, an RGB LED, three 100 Ohm resistors and a USB cable to power the device. It is mounted in a Apple Ipod Nano shipping box.  The video shows the LED wearing it's Ping-Pong ball diffuser without the case cover on.


The sample code sends the word "flash" to the Serial console every 500 msec. You can see that my demo board is a version 1.4 Launchpad with the MSP430G2553 processor which requires the TX/RX lines be crossed on the jumper block.  I used 4 shorting blocks and stuffed wires in the blocks before pushing them onto the header.

Another Example

This is the same TI Launchpad board mounted in an old IPOD shipping box that has been painted. The RGB LED's diffuser is a ping pong ball.  In this case the main loop() is running a command processor that waits for serial commands.  The TwoMsTimer based interrupt handler drives the LEDs through a table based blinking pattern.  The interrupt handler also flashes LED2 on one second intervals as a heartbeat to help debug when pattern 0 (all off) is active. The firmware source is available on github.


Source code to a TFS Build watcher that uses this devices as a build status display can be found on github

The next two pictures show a similar build in a GameStop PS2 Gift Card tin.  The tin is actually three boxes folded together so Mr Dremel had to get involved to create a cavity large enough for the Launchpad.  I mounted the LED to the bottom so I could pull the lid on and off.  The big blob of hot glue under the LED is because I didn't think through that a spacer would have been a better way to raise u the LED.

Recap

Common micro control issues with a single run-loop include the following.
  1. Code uses Delay() or Wait() calls to space out or schedule different actions.  The problem is that these delays block anything else from happening during the pause periods.
  2. Activities need to occur on relatively precise intervals irrespective of how long the the task takes to process. Delay() and Wait() calls provide a fixed delay after a task has finished when we really want the spacing to be from task-start to task-start.  Developers have to adjust their pause or wait based on how long the intermediate code took to run.
  3. Some task or activity has higher priority than running on main-line or loop() and it is hard to time call-outs from the main loop to these tasks..
Timed or event driven interrupt handles can help break the code into independent pieces that are run "on demand" rather than as part of a loop. The TwoMsTimer package for the MSP430 environment makes coding for timed events simple.

The Energia wiki describes how to download the dev environment and compile the application. They also make available the USB drivers required to interface with a TI Launchpad if you haven't installed the native TI development tools.