Wednesday, September 14, 2016

AWS EBS storage balancing size versus throughput

One of my teams is rolling out a new application in AWS on EC2 instances where the operating system runs on an EBS drive. We selected one of the M3 machines that came with SSDs.  The application makes moderate use of disk I/O.  Our benchmarks were pretty disappointing. It turns out we really didn't understand what kind of I/O we had requested and where we had actually put our data.

The Root Drive

The root drive on an EC2 instance can be SSDs or Magnetic based on the type of machine selected. All additional mounted/persistent disk drives on that machine will probably be of the same type.  This is an SSD but it is a network drive.  

EBS disks IOPS and MB/S are provisioned exactly as described in the EC2 documentation.  The most common GP2 SSDs have a burst IOPS limit and a sustained IOPS limit. They also have a maximum MB/S transfer rate.  Both the sustained IOPS limit and the maximum transfer rate are affected by the size of the provisioned disk.  Larger disks can sustain higher IOPS and have higher throughput.  

We sized our disk at 20GB which gave us low IOPS credit earning rates and lower MB/S transfer rates. That was a mistake. The sweet spot for disk drive performance is around 214 GB.  This is the smallest disk that gives you the highest transfer rate and the highest burst credit acquisition rates.  

Teams should do their own alaysis before picking the more expensive EBS volume types.  EBS GP2 burstdable SSDs may provide a higher value than fixed provisined SSDs (io1)

Burst Credits

Burst credits are a way you can store IOPS in a credit bucket so that you can exceed your provisioned sustained IOPS rate.  This lets you reach up to 3000 IOPS (GP2) in short bursts without having to pay for higher performing drives.  New machines are given 30 minutes of burst credits in order to provision the machines and warm up applications at the fastest speed possible.  Burst credits are earned based on the size of the EBS volume and the provisioned sustained IOPS rate.  Larger disks earn IOPS credits faster than smaller ones.  

IOPS vs MB/S

<I have no idea what I was planning on putting here>

IOPS, MB/S and Block Sizes

I/O Operations per second, I/O bandwidth and the data block size interact with each other to limit total throughput.  Machines that use the AWS default 16KB block size may not use their full I/O bandwidth.  AWS machines default to 16KB block sizes. Our test results agree with concept.
  • effective bandwidth = number of IOPS * the block size of each write
Teams may have to do some math to tune their disk drives in I/O bound applications.

Ephemeral local SSD

EC2 machines can make use of SSDs attached to the host machine that the EC2 instance is running on.  These disks provide significantly higher performance that must be balances against ephemberal nature. Local SSDs cannot have snapshots and disapear whenever a mechine is terminated and restarted.  All Ephemeral SSD data must be reconstitutable from other data sources since the VM with its local SSD could disapear at anny time

Benchmarks

The following table describes several benchmark tests against various drive configurations. Regular GP2 SSDs provide exactly the specified speed and througput with burst credits available and with no burst credits.  The main area of interest is around latency and the relative performance of EBS vs Ephemeral and around the impact of disk encryption. I don't understand why we have outliers in the data.  Sometimes different machines gave different ephemeral performance.  Note that Amazon does not seem to specify performance data for local SSDs.

Process Reference pages

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/benchmark_procedures.html
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

Results

This chart show the transfer rate of various disk drives.  It makes clear how much of a performance improvement can be obtained using Local SSD (ephemeral) drives over network attached EBS.  


Disk SizeDisk TypeMB/s rand write 512m bs=16kiops rand write 512m bs=16kclat latency 95% usecMB/s rand read 512m bs=16kiops rand read 512m bs=16kclat latency 95% usecMB/s AWS max MB/s 16KB maxIOPS SpecTest Device
20GBSSD (GP2)1601602.560.9660/3000No Burst
80GBSSD (GP2)47299183844930701235212848240/3000Burst
80GBSSD (GP2)3.82403.824010.243.84240/3000No Burst
240GBSSD (GP2)47.82990151684829991568016048720/3000Burst
240GBSSD (GP2)42.426521676857.335851568016048720/3000Burst
32GBEphemeral3161979236022527n/an/an/an/a
32GBEphemeral1288059432040425272916n/an/an/an/a
20GBSSD (io1)8503138741288500Fixed

*Measured and calculated using Burst credits
#Measured and calculated with no available Burst credits

Disk encryption does not affect disk throughput or IOPS.  It does increase disk latency. Local SSD performance can be affected by other VMs on the same hardware device sharing the same drives.

System configuration and Test Commands

ssh -i speedtest.pem ec2-user@ec2-54-196-6-33.compute-1.amazonaws.com
sudo yum update
sudo yum install fio
sudo mkfs -t ext4 /dev/xvdb
sudo fio --directory=/media/ephemeral0 --name=randwrite --direct=1 --rw=randwrite --bs=16k --size=1G --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap
sudo fio --directory=/media/ephemeral0 --name=randread --direct=1 --rw=randread --bs=16k --size=1G --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap 

Other AWS related references



This blog not yet finished

Wednesday, September 7, 2016

An Environment for Every Season

Video Presentation

This is really talking about different types of software environments and how they are used during the various phases of software development, integration and support.  

Environments are the touch points between your system and other teams operating on their cycles and timelines. They are also control points where controls is incrementally applied or where access is restricted based on the sensitivity of the data in that environment and its distance from production or development.

System Complexity

The number of system environments (sandboxes) required depends on your system complexity, your tolerance for defects, your level of Ops and DevOps automation and the needs of other systems that yours integrates with. This diagram shows a couple application components that integrate with 4 different partner teams and 3 data stores.  Each partner team has its own release cycle and testing needs.  We aim for loose testing coupling but that is often impossible either because of the heavy investment in mock services or because of the complexity of the interactions or the data.



Future Code Pipeline

Different companies have different names for their various stages in their pipeline.  Partner teams testing against a future release will integrate with environments on this pipeline.

The environments in white represent highly accessible  environments with low stability and low risk.  The yellow environments represent more stable environments with significantly less manual access than the earlier environments.  The red environments represent highly controlled "production" or "production like" environments that are treated as if they contain sensitive information.

CI represents the build and unit test environment.  It really isn't a true environment at all and is solely used for code base health and white box testing.  

Some companies have Mock Integration environment that can be used to run basic automation, integration or smoke tests.  Partner systems are mocked so that this environment is not at the mercy of partner deployments and stability.  Mock Integration environments tend to get developed after a team has had trouble running the nightly automation tied some other team environment's stability issues.  Dev Integration represents the first environment where your system integrates with partner systems.  Tests here  should run nightly or several times a week. Basic automation, functional or smoke tests, run in this environment.  Stability depends on the integration points .  Some companies have a partner integration environment that represents the next release but is more stable that Dev Integration.  

QA (yours) is the environment used by true testers doing either manual tests or using various types of automation. This environment is owned by "your" QA team. It integrates with QA environments provided by other teams. Builds here usually happen several times a week.

User Acceptance Test environments are the place end users, product owners and other interested parties run their tests. Deployments are "on demand" of the users in this environment.  Agile doesn't really talk about or support this within the sprint concept.  This is normally the environment used to get "external sign-off".

Prod is production.  You knew that.  Prod Mirror usually mirrors the current code in production.  It is used for production triage where testers and developers are , rightfully, not allowed access to the production system. This is sometimes called names like Production Triage.


Partner on Production Path. 

Partner teams need a place to test their future code against your current release.  They use this so that they can release without being locked into your  release cycle.  These really depend on integration complexity and how different the release cycles are.  Note that these environments use the "current" data model and schema while the new development environments use a "future" data model and schema.



Partner Integration is your integration environment that is consumed/used by the partner integration teams. This where they run functional or integration tests.

Partner QA is the environment foreign QA test teams integrate with.  Their Internal QA environment communicates with your Partner QA environment.  Some teams will forgo this and use the Prod Mirror environment mentioned above.

Tradeoffs

The number of environments can clearly be reduced by software teams that don't think they need it.  It can also be expanded to include branches, security testing or for other needs.  The decision is a tradeoff between operational  complexity and conflicting needs with respect to the data available, environmental SLA and the version of the code to be deployed.

My last three major enterprise customers all had between 5 and 7 environments.  This doubled if the same application was deployed in different data centers like cloud and on-prem.  

Highly automated companies, running in the cloud, have the added option of tearing down environments while they are not being used.   Good examples might be various QA environments or Performance testing environments.

Sunday, September 4, 2016

The Cloud is an Opportunity

"Excellence is a continuous process and not an accident"
A hybrid cloud is just an offsite data center if you migrate your applications and processes as is.

This Topic on Video

Cloud as an Opportunity

Life in the Cloud Should Be Different 

  • Opportunity to bake in policies and practices
  • Full automation is possible and required to feed continuous processes
  • Continual building and destruction of infrastructure is desirable over stale configurations
  • Dynamic and on-demand capacity is available and should be leveraged
  • It is easy to isolate teams, applications and partner firms using built in tools.
  • Resiliency must be part design and not an afterthought.
A cloud migration is an opportunity to bake in policies and practices that were impossible in your previous environments.   It is an opportunity to leverage cloud vendor provided security, automation and pre-built services in a way that increases your team's capabilities.  The cloud lets you automate your infrastructure so that you are no longer fearful of making changes or rebuilding networking, servers, load balancers or data stores.  Cloud teams regularly destroy and rebuild infrastructure guaranteeing that you know how it goes together.  Cloud subscriptions/accounts and networks let you manage and isolate applications, teams and third party applications in ways that network alone didn't in the past. Cloud environments move away from fixed deployments with fixed addresses. They move towards dynamic deployments, address mobility and require that applications be more resilient from day one.  This lets them handle the more chaotic nature of Cloud environments and makes them more robust in the face of system or other failures.

Hybrid Cloud is Not an Off Site Data Center

Bake Security into Everything You Do

  • In Transit
  • At Rest
  • Application Authentication
  • Application Authorization
  • Credential Management
  • Operational Roles
In-house data centers are often not very secure.  They tend to move data across the wire without in the clear and leave data unencrypted when in databases and on disks. Cloud migrations often start with a certain level of paranoia.  Internal services often have weak internal call verification and tend to trust other services in the same containers or networks.

Cloud migrations tend to force a reevaluation based on the notion that data is no longer inside the company.  A cloud migration is the perfect time to secure data in transit and to secure data at rest.  A fair amount of effort is required to understand the at rest requirements. Data Storage products may offer their own encryption for all or portions of data.  Some companies only protect sensitive data.  Others decide that it is simpler to have a single level of security for each device type, RDBMS, NoSQL, File, Blob...

<this portion AND to be updated later>

Automate Everything

  • Build and Test
  • Infrastructure and Network
  • Monitoring
  • Recovery
  • Data Handling
  • Price and performance selection

Services Catalog

Data Services Catalog

  • SQL Data
  • NoSQL Data
  • Large Data Storage
  • Search
  • Messaging

Non-Data Services Catalog

  • Provisioning
  • Monitoring
  • Computation
  • Scaling
  • Network, Firewalls, Routers
  • Zero Provisioning Application Platforms

Cloud Accountability

  • Costs are Explicit
  • Resource Consumers are Exposed
  • Teams and projects pick their own cost models.
  • Data is visible in common consoles

Cloud Risks

  • It feels like a shiny object
  • Staff must be multi-functional
  • Encryption keys and certificate management are critical
  • Network edges must be protected
  • Public Services Security Risks must be understood
  • Some will not believe automation and change is possible or desireable