What do we want out of load or performance test?

We use performance tests to verify the raw throughput of some subsystems and to verify the overall impact some subsystem has on an entire ecosystem. Load tests act as documentation for performance indicators and re-enforce performance expectations. They are vital in identifying performance regression.

Load and performance tests are an often overlooked part of the software release lifecycle. Load tests, at their most basic level, are about stress testing a system by dropping a lot of work onto it. Sometimes it is a percentage of expected load, other times it is the expected load, and other times it is future expected levels of load. A failure to test expected near-term load can lead to spectacular public failures.

Video

Measurements

Your business requirements determine requirements for throughput, latency.
Your financial requirements impact the choices you take towards achieving those goals.
Your technical implementation determines whether you improve efficiency or scale up/out to meet those requirements.

Two simple metrics that determine if your system will meet its operational needs under load. Operations Per Second is the total amount of traffic that can be pushed through a system. Latency is how long a requestor has to wait for their operation to complete. They represent the what.

A different set of metrics determine what needs to be done to improve the system to meet your needs. We measure Resource Utilization and Execution Profiles to understand why a system behaves the way it does.

Determine your OPS and Latency needs. Test to see if you meet those needs. Tune via efficiency improvements or via resource scaling.

Latency

This is the processing time of an individual request. You may desire low latency when a human is waiting or when something will blow up if it doesn't get an answer in time. You may not care very much about the actual latency if the response time is relatively unimportant.

Some systems will prioritize work and provide different latencies for different workloads. Network routers prioritize video traffic, changing potential latency because web traffic can survive higher latency and variability than video feeds.

Latency = execution time of single operation

or another way

Wait-time ∝ Latency * Resource Availability

Improving resource utilization can improve latency. You have to decide if the improved efficiency is worth the spend it takes to optimize for latency.

A rocket engine control loop needs low latency. A lawn sprinkler system still works if it takes 30 seconds to fully engage all the sprinkler heads.

Operations Per Second

OPS is the raw throughput of your system. It represents the number of requests that your system can process in some time period. It does not describe how long any specific request takes to complete. Throughput and latency can be related to each other by

Throughput ∝ latency * parallelism

or another way

Throughput ∝ efficiency * parallelism

A system can have high throughput and high latency. The local DMV office improves their wait line throughput by adding additional windows that accept the work.

Resource Utilization

Resource utilization is measured as part of load testing. These measurements are used to try and understand resource constraints and where throttling occurs.

Optimizing resource utilization can lower latency and costs. It is the alternative to scaling up/out to achieve performance goals.

You can build a system that meets your OPS and latency needs and not be able to afford to run it. Load tests help you measure your resource utilization under load. This can provide hints on where to optimize or tune to improve performance or lower cost.

Simple & Repeatable or Complicated & Accurate

One is the easiest number

Stand-alone load tests involve as few components as possible. A web service stand-alone test may involve just a web service and a database. An ingestion test may involve just the ingestion feeder and the components on the path to the database.

Stand-alone tests can be used to predict whether a performance will improve or get worse. They are a relative measure of performance change and not absolute metrics.

Stand-alone load tests involve the minimum number of inputs and components are the simplest and most repeatable. They reduce the number of variables and simplify execution. Stand-alone load tests provide great regression test platforms.

They are lousy proxies for absolute system performance. Your pieces share resource access in the real system. Isolated load tests do not take into account the side effects of other components.

You can optimize your stand-alone load test for the highest performance possible and consume all available shared resources. This can make your component essentially do a Denial of Service attack on the other pieces using the same shared components, like the database.

Everyone gets involved

Full end-to-end integrated load tests try to exercise all ingestion and extraction processes in a way that mimics production. These tests provide true metrics that may align with what you will see in production. They can be done by running simultaneous load tests in the related systems or by creating simulated traffic for some of the pieces to replicate the final load without having to coordinate across a bunch of other systems.

Generating production-sized data can be hard. One approach is to tap production traffic and sanitize and anonymize that data before using it to drive a broad load test. This approach guarantees that the shape of the test data can match the shape and distribution of the production data

End

Running performance tests with no notion of KPIs can lead to frustration is a waste of resources. Push to understand the performance and resource utilization requirements so that you can structure your tests correctly.

This article has probably already gone too long so I'll end here for now. :-)

Created 2021/04

Blog de Joe Freeman