The answer is very simple, Player A. Because, you just need to teach “A” to kick to the left.
💡 Reducing variability is essential to get reliable results.
Therefore, it is strongly recommended to:
- Use a dedicated server to perform benchmarks. In AWS Console, it’s called Dedicated Instances, which are Amazon EC2 instances that run in a VPC on hardware that’s dedicated to a single customer.
- Reduce any turbulence that might affect your benchmark result — One can use
htop to see background processes and then, close them.
- Try to run the benchmark as closely as possible to the production environment. This recommendation is important for performance-driven applications such as stock market software.
As stated earlier, reducing the noise is a tough operation. But, it must be done if you want to have reliable results.
Isolate your Microbenchmarks
Look at the following snippet:
Running it will provide an interesting result:
It seems pretty conclusive, the arrow function is almost 3x faster than a regular function.
However, when changing the order of the calls, you will get another interesting result:
Now, the ‘regular function’ is the fastest one. The reason is due to the V8 engine optimizing and de-optimizing function calls, all in the same environment.
Therefore, be sure to reset the V8 state on each run.
Metrics are hard and their evaluation is a crucial part of a benchmark report. Metrics can be confusing, complicated, unreliable, inaccurate, and even plain wrong (due to bugs).
Usually, when realizing performance tweaks on an application, a common workflow is:
- Run a benchmark before the change
- Run a benchmark after the change
- Compare the first run against the second run
Let’s assume you are measuring execution time, and the first run took 45 seconds to complete, then, the second run, after the change, took 42 seconds to complete. Therefore, you assume your changes improved the execution time by ~6%.
Hence, you create a Pull Request with the B changes and someone from your team performs the same workflow (benchmark before and after the change, compare the results and evaluate), but this time, the execution time took 46 and 45 seconds respectively; reducing your awesome performance improvement to ~2%.
Even reducing variability, some benchmarks simply vary. Therefore, you may ask:
- “How many times should I run a benchmark?”
The answer depends on the variance interval. The Rigorous Benchmarking in Reasonable Time¹ is an excellent resource on this topic, this paper shows how to establish the repetition count necessary for any evaluation to be reliable.
Student’s test (t-test) is a statistical method used in the testing of the null hypothesis(H0) for the comparison of means between groups. Running a t-test helps you to understand whether the differences are statistically significant — However, if performance improvements are large, 2x more, for example, there is no need for statistical machinery to prove they are real. A practical example of this method in an application is the Node.js core benchmarking suite and
While computing a confidence interval, the number of samples n (benchmark executions) are categorized into two groups:
- n is large (usually ≥ 30).
- n is small (usually < 30).
This article approaches the first group (n ≥ 30) — Both groups are covered in detail in the paper Statistically Rigorous Java Performance Evaluation² – section 3.2. The module
ttest will abstract the confidence calculation. In case you are interested in the equation, see the paper mentioned previously³.
The following snippet is a collection of benchmark results before the change (A) and after the change (B):
⚠️ The Student’s t-test approach relies on the mean of each group. When dealing with HTTP Benchmarks, outliers can happen, making the mean useless info, so be careful with the mean. Always plot your data into a graph so you can understand its behaviour.
ttest module can be used to calculate the significance of the variance:
This analysis enables one to determine whether differences observed in measurements are due to random fluctuations in the measurements or due to actual differences in the alternatives compared against each other. Typically, 5% is a threshold used to identify actual differences.
As a probabilistic test, the current result allows you to say: “I am 95% sure my optimization makes a difference”.
Do not forget, the benchmark insights come from the difference between branches instead of raw values and even using a probabilistic test is extremely important to know your data, plotting them into a graph is always helpful.
Be Realistic in your Benchmarks
Sometimes the benchmark result is totally accurate, but, the way they are shared is tendentious — this often happens on micro-benchmarks. For example, I maintain a repository called nodejs-bench-operations with the intention to measure simple Node.js operations across different Node.js versions and eventually, help developers to use a faster solution.
|Using parseInt(x,10) – small number (2 len)
|Using parseInt(x,10) – big number (10 len)
|Using + – small number (2 len)
|Using + – big number (10 len)
The unit used is operations per second (ops/sec). Looking at the table it’s fair to say that “Using +” is at least 4x faster than “Using parseInt(x, 10)”.
However, you have to take this with a grain of salt. Technically, it’s indeed 4x faster, but when using it in a production application, it can mean little improvement in the end, and sometimes the trade-off to make use of a faster approach might not be worth it.
For example, the same operation measured using execution time as the metric unit will show:
Therefore, for conventional software one needs to consider if 0.00641608238ms of improvement on each call is worth it.
Normally any performance improvement is welcome, but in some circumstances, the complexity or disadvantages of implementing the faster approach may not be worth it. By the way, that’s not the case with the plus signal over
Benchmark Results can Tell you More Than Performance Gotchas
Through the results and the active benchmark, it is possible to predict the software limitations.
Let’s say you are looking at an existing system currently performing a thousand requests per second. The busiest resources are the 2 CPUs, which are averaging 60% utilization; Therefore, through basic math you can find a potential limitation using the following equation:
CPU% per request = 2 x 60%/1000 = 0.12% CPU per request
Max requests/s = 200/0.12 = 1665 req/sec approximately.
This is a common supposition for CPU Bound applications, however, it ignores the fact that other resources can reach their limitation before the CPU. Therefore, 1665 req/sec can be considered the maximum req/sec this application can achieve before reaching CPU Saturation.
Many benchmarks simulate customer application workloads; these are sometimes called macro-benchmarks. They may be based on the workload characterization of the production environment. Macro benchmarks might be either stateless or stateful (where each request is dependent on the client state).
As described by the Preparing the Environment section, simulating the production workload will provide the essential information one needs to have in order to make it better. Hence, ensure to make realistic benchmarks.
As important as benchmarking correctly is the process of Evaluating Benchmark Results, understanding the data is critical for decision-making. Plotting the result into a graph is a great way to visualize outliers. Usually, the mean can hide issues, therefore, it is not recommended to rely on a single metric.
Really appreciate the ones that reviewed that long article:
and obviously, Nearform for sponsoring me to perform these studies.