Performance: reaching ludicrous speed
By Matteo Collina

This post is a write up of my talk at this year’s Node.js Interactive conference – see the video here.

Node.js Interactive

Being fast is a big deal. It is not just a race between spec implementers; speed enables a platform to be viable in certain environments.

Back in the early days of Node.js, I believed that Node.js could be faster and I went down the rabbit hole of performance optimizations. At the very first NodeConf EU in 2013, I presented MQTT.js and Mosca, the Node.js client and broker module for the MQTT protocol.

I claimed Mosca was fast because it could process approximately 20,000 messages per second.

How much faster could I make a Node.js application? I set myself the goal of achieving a tenfold performance boost. In other words, I wanted my broker to ingest approximately 200,000 messages per second. I achieved that goal by doing a complete rewrite (see below) based on the new core Aedes and the new MQTT.js v2 (still unreleased). For the sake of comparison, using the same benchmark, Mosquitto can ingest approximately 130,000 messages per second.

Node.js Interactive

Figure 1. The Node.js event loop

In order to help you understand how to make a Node.js application go at ludicrous speed, allow me a brief digression on how Node.js works. Figure 1 shows the Node.js event loop that runs after the initial JavaScript (JS) is executed when a Node.js application is launched. This figure shows how Node.js can schedule asynchronous I/O using either the kernel API (net or fs, among others) or the thread pool (all of crypto, and a lot of third party modules). When one of those asynchronous operations is completed, an event is scheduled in the event queue, and then picked and executed inside the JS virtual machine. In turn, the JS code might schedule some more asynchronous I/O, and the process continues until there are no more asynchronous operations or events to process. Then Node.js exits.

What does ‘fast’ mean in the context of Node.js? It means being able to do more I/O operations in a given time frame. As only one event is processed at any given time in the JS VM, this usually means making the JS code faster. However, we need always to remember the bigger picture, as we will see in the last few sections below.

I have tried several tools to make Mosca faster: dtrace, mdb, flamegraphs, perf, and others. The only thing that has given me a useful insight are node --trace_opt, --trace_inlining and --trace_deopt. These three switches are extremely important for us in understanding how Node.js ‘sees’ our functions, and whether or how it optimizes them. A function can be optimized by V8 (our JS VM) provided that it obeys certain rules, but only if it lasts long enough. Closures do not usually last long enough to be optimized. Inlining happens during the optimization process (see a full description here). De-optimization happens when V8 makes the wrong choice in optimizing a function, or if we do something ‘bad’ with it (passing different types, for example). IRHydra2 is a nice tool to help in addressing those issues.

When I ran my application with those command line options, the output was full of the following:

[marking 0x1a308644a581
  <JS Function (SharedFunctionInfo 0xf1f928b34e9)>
  for recompilation, reason: small function,
  ICs with typeinfo: 8/8 (100%), generic ICs: 0/8 (0%)]

This means that an anonymous function is flagged as a possible optimization target, but the function address does not appear in our logs any more. Instead, the function is collected by the garbage collector (GC). These functions have usually the following signature:

function (err, data) {
  /* whatever is done here
     is not going to be optimized */
}

This is known as the ‘Node callback style’ and is extremely common in Node.js applications (promises have even more problems, as they allocate more closures).

The examples below demonstrate how our abstractions and interaction model affect our performance. We will call setImmediate three times in parallel as a reference (using my fastbench module):

'use strict'

var bench = require('fastbench')
var max = 1000000

var nextDone
var nextCount

function benchSetImmediate (done) {
  nextCount = 3
  nextDone = done
  setImmediate(somethingImmediate)
  setImmediate(somethingImmediate)
  setImmediate(somethingImmediate)
}

function somethingImmediate () {
  nextCount--
  if (nextCount === 0) {
    nextDone()
  }
}

module.exports = benchSetImmediate

if (require.main === module) {
  var run = bench([benchSetImmediate], max)
  run(run)
}

The above examples take 1685 ms for a single run. The naive approach to running a function three times is as follows:

'use strict'

var bench = require('fastbench')
var benchSetImmediate = require('./benchSetImmediate')
var max = 1000000

function parallel (tasks, done) {
  var count = 0
  var length = tasks.length
  tasks.forEach(function (task) {
    task(function (err) {
      if (err) { return done (err) }
      if (++count === length) {
        done()
      }
    })
  })
}

function benchNaive (cb) {
  parallel([immediate, immediate, immediate], cb)
}

function immediate (cb) {
  setImmediate(cb)
}

module.exports = benchNaive

if (require.main === module) {
  var run = bench([benchNaive, benchSetImmediate], max)
  run(run)
}

This naive example takes approximately 2000 ms (MacBook Pro 2014, i7, 16GB of RAM) to complete a single run, as it allocates a lot of closures:

[marking 0x1585551f7029 <JS Function (SharedFunctionInfo
0x37356cb47dd9)> for recompilation, reason: small function, ICs with
typeinfo: 4/5 (80%), generic ICs: 0/5 (0%)]

We can improve the naive method above by removing some function allocations:

'use strict'

var benchSetImmediate = require('./benchSetImmediate')
var benchNaive = require('./naive')
var bench = require('fastbench')
var max = 1000000

function parallel (tasks, done) {
  var count = 0
  var length = tasks.length

  for (var i = 0; i < length; i++) {
    tasks[i](wrap)
  }

  function wrap (err) {
    if (err) { return done (err) }
    if (++count === length) {
      done()
    }
  }
}

function benchNaiveFaster (cb) {
  parallel([immediate, immediate, immediate], cb)
}

function immediate (cb) {
  setImmediate(cb)
}

module.exports = benchNaiveFaster

if (require.main == module) {
  var run = bench([benchNaive, benchNaiveFaster, benchSetImmediate], max)
  run(run)
}

This example takes roughly 1850 ms on my system, which means that it improves execution speed by 10%. However, this technique removes the ability to track down the point in time at which a single function completes, and makes it impossible for us to pinpoint when a single function completes. Therefore, we cannot track results (or do a map operation).
I independently created the following technique to recycle closures and wrapped it in the reusify module:

‘use strict’

var benchSetImmediate = require('./benchSetImmediate')
var benchNaive = require('./naive')
var benchNaiveFaster = require('./naive-faster')
var bench = require('fastbench')
var reusify = require('reusify')
var pool = reusify(Holder)
var max = 1000000

function parallel (tasks, done) {
  var count = 0
  var length = tasks.length
  var holder = pool.get()

  holder.count = length
  holder.done = done

  for (var i = 0; i < length; i++) {
    tasks[i](holder.release)
  }
}

function Holder () {
  this.next = null
  this.done = noop
  this.count = 0

  var that = this
  this.release = function (err) {
    var done = that.done

    if (err) {
      done(err)
    }

    if (--that.count === 0) {
      that.done = noop
      done()
      pool.release(that)
    }
  }
}

function noop () {}

function benchWithReusify (cb) {
  parallel([immediate, immediate, immediate], cb)
}

function immediate (cb) {
  setImmediate(cb)
}

module.exports = benchWithReusify

if (require.main == module) {
  var run = bench([
    benchNaive,
    benchNaiveFaster,
    benchWithReusify,
    benchSetImmediate
  ], max)
  run(run)
}

This example shows you how to use reusify to recycle a function: you place it in an object pool. The function is encapsulated inside a Holder instance, which can be optimized greatly by V8 via a hidden class. Note the this.next property. This will be used by reusify to store the holders in a linked list (which is faster than an array). By applying this technique twice (not shown here), we can track the result of each function. This behavior is wrapped into the steed library.

Steed
Steed provides an API that is similar to async, but with an optional first argument that represents the context/this of the given function (named that in the Steed documentation):

‘use strict’

var bench = require('fastbench')
var async = require('async')
var steed = require('steed')()
var max = 1000000

function benchSteed (cb) {
  steed.map(new State(cb, 2), [1, 2, 3], multiply, done)
}

function State (cb, factor) {
  this.cb = cb
  this.factor = factor
}

function multiply (arg, cb) {
  setImmediate(cb, null, arg * this.factor)
}

function done (err, results) {
  if (err) { return this.cb(err) }

  var acc = 0
  for (var i = 0; i < results.length; i++) {
    acc += results[i]
  }

  this.cb(null, acc)
}

function benchAsync (cb) {
  var factor = 2
  async.map([1, 2, 3], function work (arg, cb) {
    setImmediate(cb, null, arg * factor)
  }, function (err, results) {
    if (err) { return cb(err) }

    var acc = 0
    for (var i = 0; i < results.length; i++) {
      acc += results[i]
    }

    cb(null, acc)
  })
}

if (require.main == module) {
  var run = bench([
    benchSteed,
    benchAsync
  ], max)

  run(run)
}

The Steed version takes approximately 3200 ms to complete, versus the async one, which takes approximately 6400 ms. In other words, the Steed version is twice as fast as the async one. Steed achieves this by using top-level functions that can be greatly optimized by V8, and by wrapping our own state inside a State object.

As of now, Steed implements the each, eachSeries, map, mapSeries, parallel, series, waterfall, and queue methods. Check out the docs. These methods are provided by small modules. If you want something slightly faster without checks, have a look at the source code.

We can now derive a first batch of rules for hot code path:

  • Do not allocate functions.
  • Make sure that V8 can optimize the functions that you allocate.
  • When you are at 100% of CPU, everything is hot.

We still need to face our last enemy before we can reach ludicrous speed: Lord GC.

Figure 2. The Node.js event loop with garbage collection

Figure 2. The Node.js event loop with garbage collection

Whenever we allocate a function, an object, an array, or even just a number, we use memory. From time to time, the garbage collector (GC) needs to free up this memory. In Node.js (and V8), the GC operates asynchronously or in a stop-the-world fashion: the latest would block all your JS execution. We need to be gentle with our allocation, and avoid the stop-the-world GC whenever possible, because we cannot process I/O events, or even run our own functions while it is running. For the sake of this article, we oversimplified. Read this for an in-depth explanation.

In Mosca, I was allocating a buffer for each packet and copying over all the relevant information for each one. After being written to the stream, those buffers were collected by the GC.

The solution to this problem is to stream the packet generation and avoid allocation entirely. This saves a few resource-intensive memory copying (memcpy) operations, but more importantly, it saves GC time. This approach was slower in node v0.8 and v0.10 because it needed to cross the JS/C++ barrier a lot of times to send the small buffers to the kernel. However, Node.js 0.12 introduced the cork() api, which ‘stops’ the stream from passing data to the underlying layer until uncork() is called.

Finally, dividing the packet generation into small chunks allowed me to memoize the headers. The result is that the headers do not need to be allocated once and never collected again, which further increases speed.

We can then derive a second set of rules for hot code paths:

  • GC time counts, so reduce allocations as much as possible.
  • Slow code that does not allocate might be faster.

At the very end of this blog post, it is worth stating that most applications do not need to go at ludicrous speed. However, any application usually goes through a phase of performance optimizations, and the topics covered can help you a long way:

  1. Always measure. Check out fastbench as a low-level async benchmarking tool.
  2. Choose top level functions over closures.
  3. Use control-flow libraries that add as little overhead as possible, such as Steed.
  4. Be careful with data allocation.

Using a set of best practices can enable you to reduce latency and cloud costs with very little work. May the Steed be with you.

PS. nearForm offers expert-led training courses on a range of Node.js topics, including the subject of optimizing the performance of your applications. Training can be standard or customized – just ask. We can also help your enterprise to transition to Node.js with our consulting and co-development.

 

Subscribe to our monthly newsletter!
join the discussion