Skip to content

Integrating backpressure into the infrastructure

Continuing our exploration of the Fastify plugin under-pressure, we deploy our backpressure mechanism to Kubernetes.

Implementing a mechanism to preserve the performance and health of a Node.js Fastify application deployed to Kubernetes.

We continue to explore the benefits of the Fastify plugin under-pressure. Previously we used a custom Prometheus metric to build a simple backpressure mechanism in a Fastify application; now we look at integrating our backpressure mechanism into our infrastructure.

The sample code for both posts is available at nearform/backpressure-example . The code for this part is in the part-2 branch. Check out the sample code for this part:

Plain Text
git clone
cd backpressure-example
git checkout part-2


Our infrastructure will consist of a Kubernetes workload deployed via Helm. It also requires Docker to create the image of the application that we’ll deploy to the cluster. If you don’t have a Kubernetes cluster available, a simple way to run a cluster in your local environment is to use Docker Desktop, which includes Kubernetes.

You can follow each individual tool’s setup instructions:

Once you’re set up, the following CLI programs should be available in your terminal:

  • docker
  • kubectl
  • helm

If you prefer to follow along without installing the tools, simply read on and look at the accompanying source code.

Kubernetes liveness and readiness probes

In the first part of the article we decided that we would open the circuit when the response times of our application’s /slow endpoint exceeded 4 times the expected response time of 200ms. When this happened, we returned a 503 Service Unavailable HTTP error via under-pressure .

This is a safety mechanism to prevent the application from being overwhelmed with requests, and not something that should happen when our application runs in production.

Instead, we want to make sure that our infrastructure stops serving requests to the application before we reach that point. To do this, we’ll use Kubernetes probes .

We’ll change the application code so that it exposes two additional endpoints, named /liveness and /readiness .

The /liveness endpoint is the simplest one because, based on how it’s expected to work from a Kubernetes’ perspective, it should always return a successful response in our case.

The /readiness endpoint is more interesting because, based on its response, Kubernetes decides whether to serve requests to the Pod or not.

Earlier, we configured our safety mechanism to stop accepting requests at a threshold 4 times above the expected response times. Intuitively, we want to configure the readiness probe at a lower threshold — for example, twice the expected response time.

To do so, we change our application’s slow.js module as follows:

Plain Text
function canAcceptMoreRequests() {
  // twice the expected duration
  return metric.get999Percentile() <= (REQUEST_DURATION_MS * 2) / 1e3

We also encapsulate our custom Prometheus metric in its own module, which now reads as:

Plain Text
const prometheus = require('prom-client')</p><p>const metric = new prometheus.Summary({
  name: 'http_request_duration_seconds',
  help: 'request duration summary in seconds',
  maxAgeSeconds: 60,
  ageBuckets: 5,
})</p><p>metric.get999Percentile = () => {
  return metric.get().values[6].value
}</p><p>module.exports = metric

Then, in the root of our application we create the two endpoints that will be used by Kubernetes probes:

Plain Text
fastify.get('/liveness', async () => {
  return 'OK'
})</p><p>fastify.get('/readiness', async () => {
  if (slow.canAcceptMoreRequests()) {
    return 'OK'
  }</p><p>  throw new TooManyRequests('Unable to accept new requests')

The readiness probe is configured to respond with an error before the circuit opens, so we are relying on the infrastructure to stop serving requests when the probe delivers such a response.

The circuit breaker is a safety net in case the infrastructure doesn’t respond quickly enough.

The liveness probe is simpler because it will always return a successful response. Our example has no errors from which the application cannot recover. A more realistic implementation of the liveness probe would take into account additional factors, such as a database connection that cannot be established, which would cause the application to be permanently unhealthy. In that case, the liveness endpoint should return an error.

Deploying the application to Kubernetes

The first thing we need to do to run our application in the Kubernetes cluster is create an image of the application using Docker:

Plain Text
docker build -t backpressure-example .

Then we install all the Helm charts needed in our example, which includes the application and other services, which we’ll look at later.

Plain Text
helm install backpressure-example helm/

Finally, we can check the local port on which the application is running by executing:

Plain Text
kubectl get service backpressure-example

The command above gives an output similar to:

Plain Text
NAME                   TYPE       CLUSTER-IP     EXTERNAL-IP   
PORT(S)        AGE
backpressure-example   NodePort    
80:31470/TCP   22h

We can now access the application at https://localhost:31470 (the port will most likely be different on your machine).

Triggering the readiness probe

The configuration for the Kubernetes deployment can be found in the source code repository accompanying this article. The relevant section of the configuration file is:

Plain Text
    path: /liveness
    port: web
  initialDelaySeconds: 3
  periodSeconds: 3
    path: /readiness
    port: web
  initialDelaySeconds: 3
  periodSeconds: 3

This configures the liveness and the readiness probes. We’re now going to trigger the readiness probe by putting the application under load via autocannon as we’ve done in the first part of this article.

If you haven’t used autocannon before, you can install it via npm:

Plain Text
npm install -g autocannon

Before hitting the application, let’s keep an eye on the status of the Kubernetes deployment so we can check when the single Pod we currently have turns from ready to non-ready due to the readiness probe:

Plain Text
kubectl get deployment backpressure-example-deployment -w

This will show an output similar to:

Plain Text
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
backpressure-example-deployment   1/1     1            1           22h

The above output means that there is 1 Pod ready out of a total of 1 Pods, which is what we expect because only one is deployed.

In another terminal window, we can now run autocannon in the usual way, making sure to use the HTTP port the service is bound to on our host machine:

Plain Text
autocannon https://localhost:31470/slow -c 20 -p 20

To make the /readiness endpoint return an error status code, we need to put enough load on the application to make the .999th percentile of the requests last at least 400ms, which is the threshold we configured.

You can check how long requests are taking by hitting the /metrics endpoint in your browser and by changing the autocannon options accordingly.

When the threshold is reached, Kubernetes will detect that the application is reporting that it’s not ready to receive more requests and will remove the Pod from the load balancer. The output of the earlier kubectl get deployment command will show something like this:

Plain Text
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
backpressure-example-deployment   1/1     1            1           22h
backpressure-example-deployment   0/1     1            0           22h

When the autocannon run completes, the application will reflect the shorter response times in the metrics values, which will cause Kubernetes to detect a successful readiness probe and put the Pod back into the load balancer:

Plain Text
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
backpressure-example-deployment   1/1     1            1           22h
backpressure-example-deployment   0/1     1            0           22h
backpressure-example-deployment   1/1     1            1           22h

Up to this point we’ve achieved the ability to stop overloading the application by means of an internal circuit breaker and via Kubernetes’ readiness probe. The next step is to automatically scale the application based on load.

Exposing custom metrics

To allow Kubernetes to scale our application, we will need to expose custom metrics that can be used by Kubernetes’ Horizontal Pod Autoscaler (HPA).

By default, the autoscaler can use a range of metrics built into Kubernetes, and we could use those metrics for autoscaling. In our example, we want to use a custom metric. Therefore, we need to make sure we expose that metric to Kubernetes and make it available to the autoscaler.

We achieve that by using Prometheus Adapter , which is already running inside our Helm deployment.

The relevant section of the configuration is:

Plain Text
  default: false
    - seriesQuery: 'http_request_duration_seconds'
          kubernetes_pod_name: { resource: 'pod' }
          kubernetes_namespace: { resource: 'namespace' }
      metricsQuery: sum(http_request_duration_seconds{quantile="0.999", 
kubernetes_pod_name =~"backpressure-example-deployment.*"}) by (kubernetes_pod_name)

With this configuration we can then query the metric:

Plain Text
kubectl get --raw
 "/apis/*/http_request_duration_seconds" | jq .

This will provide an output similar to:

Plain Text
  "kind": "MetricValueList",                                                                                   
  "apiVersion": "",                                                               
  "metadata": {                                                                                                
    "selfLink": "/apis/"
  "items": [                                                                                                   
      "describedObject": {                                                                                     
        "kind": "Pod",                                                                                         
        "namespace": "default",                                                                                
        "name": "backpressure-example-deployment-ff555459f-5g5x7",                                             
        "apiVersion": "/v1"                                                                                    
      "metricName": "http_request_duration_seconds",                                                           
      "timestamp": "2021-01-05T13:31:14Z",                                                                     
      "value": "0",                                                                                            
      "selector": null                                                                                         

The output above shows a value of 0 for the http_request_duration_seconds , which is the name of the metric we expose and which maps to the .999th percentile reported by our custom metric.

If you try hitting the /slow endpoint manually or with autocannon, you will see the value of the metric reflect the value reported by the /metrics endpoint. The values will not be in sync because there is a certain delay in the update of the Kubernetes metric due to polling and propagation of the metric from the application to Prometheus and then from Prometheus to Kubernetes.


The last step in getting our infrastructure to handle the increasing load on the application properly is to enable automatic scaling via Kubernetes’ Horizontal Pod Autoscaler .

This requires a simple change in our Helm chart, which deploys a resource of type HorizontalPodAutoscaler . We will include an additional chart in our Helm deployment. This is available in the branch part-2-hpa .

Plain Text
git checkout part-2-hpa

The autoscaler will need metrics upon which to carry out the auto scaling logic. In our case, it will be our custom metric:

Plain Text
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
  name: test
  namespace: default
    apiVersion: apps/v1
    kind: Deployment
    name: backpressure-example-deployment
  minReplicas: 1
  maxReplicas: 4
    - type: Pods
        metricName: 'http_request_duration_seconds'
        targetAverageValue: 300m

We have configured a minimum of 1 and a maximum of 4 replicas for the Pods running our application and a target value of 300ms for the custom metric exposed to Kubernetes via the Prometheus Adapter.

We can test the behaviour of the autoscaler by upgrading our deployment with:

Plain Text
helm upgrade backpressure-example helm/

We can now run autocannon against the application and, by watching the value of the /metrics endpoint, increase the load so that the response times go above 300ms.

Plain Text
autocannon https://localhost:31470/slow -c 20 -p 20

If we keep an eye on the deployment...

Plain Text
kubectl describe deployment backpressure-example-deployment -w

...we will see that when the metric value exceeds the threshold, the autoscaler will increase the number of Pods:

Plain Text
NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
backpressure-example-deployment   1/2     2            1           26h
backpressure-example-deployment   2/2     2            2           26h

To confirm this, we can look at the output of:

Plain Text
kubectl describe hpa

This will tell us the reason why the autoscaler increased the number of replicas:

Plain Text
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    SucceededRescale    the HPA controller was able to update the target scale to 2
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request_duration_seconds
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  55s   horizontal-pod-autoscaler  New size: 2; reason: pods metric http_request_duration_seconds above target

Putting it all together

Here is a summary of how our application will behave using the circuit breaker, the readiness probe and the autoscaler:

  • When the average value across Pods of the .999th percentile of the response time is above 300ms, the autoscaler will increase the replicas up to a maximum of 4.
  • When the .999th percentile of the response times of each single Pod is above 400ms, the Pod will fail the readiness probe and will be taken out of the load balancer by Kubernetes. It will be added back to the load balancer when the response times decrease below the threshold.
  • When the .999th percentile of the response times of each single Pod is above 800ms, the application’s circuit breaker will open as a safety mechanism, and the application will reject further requests until the circuit is closed. This happens when the response times fall below the threshold and is handled by under-pressure.

Though seemingly arbitrary, the threshold values are chosen so that:

  • The autoscaler kicks in first (300ms threshold).
  • If for any reason a Pod keeps receiving more requests than it can handle, it will fail the readiness probe, causing Kubernetes to stop serving it requests (400ms) in order to preserve the responsiveness of the Pod for the outstanding requests.
  • If for any reason a Pod keeps being served requests despite failing the readiness probe, it will trigger the circuit breaker which will cause further requests to be rejected (800ms).

In this pair of articles, we’ve outlined how to create a complex mechanism capable of preserving the performance and health of a Node.js Fastify application deployed to Kubernetes.

The mechanism consisted of an in-application circuit breaker implemented via under-pressure, a readiness probe handled by Kubernetes and an autoscaling algorithm handled by Kubernetes HPA.

We used a custom metric calculated and exposed via Prometheus to define whether the application was healthy and responsive.

This allowed us to scale our application automatically when the response times increased, preserve the performance of the application by temporarily excluding it from the load balancer when response times were higher than normal and stop responding to requests when doing so would compromise the health of the application.

Insight, imagination and expertly engineered solutions to accelerate and sustain progress.