Earlier in 2020, NearForm were asked by the company behind a highly popular online casual game to help identify areas to optimise in the game infrastructure to reduce costs and improve service delivery. What we found was an opportunity to rework the codebase with a specific focus on performance, that would not only achieve the stated bottom-line goals but also enhance the playing experience for the game’s several million players.
Client: An online casual gaming company, whose mobile game is played daily by several million players
Objective: Cut infrastructure costs by improving the performance of their backend service; create a “performance-aware” culture amongst their teams
Solution: A team of NearForm experts, working hand-in-hand with the customer teams, with hardened methodology and appropriate tools, to analyse the Node.js codebase, find bottlenecks and propose fixes
As is the case with most regular mobile applications, the game relies heavily on an online service exposing REST endpoints. Almost all the gaming elements are controlled from the server: player progression across levels, action handlers, events and so on. The game could not work without it.
To handle the massive amount of players, the AWS-hosted service required thousands of EC2 instances, wired to dozens of Redis clusters, which hold player data. Written as a monolithic Express.js application, the service had a low throughput for some endpoints and a very long startup time (around three minutes).
Because of this setup, the sudden load induced by regular and ad-hoc events in the game could only be handled by over-provisioning instances on the platform — meaning the company was paying for a lot of space they didn’t always need. In order to absorb a few peaks, the majority of their servers were sitting idle nearly all the time, with less than 40% CPU in use on average.
The game is free to play, so its cost per player has to be as low as possible. NearForm was on a mission not only to improve the service raw performance but also to set up a “performance-aware” culture, so the customer would not hit the same pitfalls in the future.
NearForm was initially engaged for short-term consultancy work on the game’s codebase, which led to the creation of a list of improvement points. Given the complexity of its service, the company quickly understood they needed a deeper involvement from us to get valuable results.
NearForm dedicated a team of three experts over a period of three months to work closely with the company’s project team. After a quick warm-up to set up the performance tool chain and gather business knowledge, our experts rolled out their methodology to the fully supportive project team:
- Simulate realistic load with benchmark tools to obtain performance measures and identify bottlenecks.
- Triage and prioritise the findings with the company team.
- Improve one specific point, re-running the benchmarks to assess results.
- Test and ship the improvement on the production platform.
- Assess the final outcome using production metrics.
In the same way we use Agile methodologies to develop “regular” projects, the team quickly iterated in two-week periods, using a Kanban board to track progress.
Following the customer delivery and QA process, the performance team used a set of tools to collect metrics and identify bottlenecks in a reliable and reproducible way. These tools included:
- Autocannon and Autocannon-compare to simulate heavy load on a web server and collect latency/throughput results,
- Clinic.js Doctor, Flame and Bubbleprof to collect KPIs on CPU/memory usage, CPU-intensive functions and long-running/stalled functions,
- Heap-profiler to sample heap memory and collect allocation timelines.
These tools, combined with tailor-made benchmark scenarii, have become a fundamental asset that the company teams continue using to measure and improve their service performance. They enabled us to quickly spot unperformant code and anti-patterns, which we also fixed alongside the team, progressively introducing the company’s developers to performant code habits and patterns.
“Seeing is believing. You probably think that this small change doesn’t do much. However, our metrics in production prove the opposite.”
—Lead developer of the customer performance team
The service performance level progressively improved during the project:
- The latency of most endpoints was reduced by 10% to 20%, and their throughput is now much more stable.
- The memory profile used to have a “sawtooth” shape, a hint that too many objects are allocated too quickly, increasing the frequency of V8’s mark and sweep garbage collection. With our optimisation, the profile is now much more flat, reducing the GC interruptions.
- Startup time was cut by half, paving the way for a more reactive auto-scaling policy.
At the very end of the project, a massive online event induced a load increase of 30%. The platform absorbed it without batting an eyelid.
All combined, these results led the gaming company to revise the number of EC2 instances used, limiting over-provisioning and finally saving cost (and energy!). We also organised company-wide presentations to introduce the tooling used, demonstrate how to diagnose performance issues and show how to replace inefficient code with performant patterns.
Perhaps the most important, lasting result is how the collaboration between NearForm experts and the project teams (peer programming, code reviews and presentations) let the company’s developers build the confidence they needed to carry on the work without us.