One approach we have seen is to deploy tooling that will report cloud costs, review outputs regularly (once a month to once a week), and implement changes based on these reports.
To implement this correctly, we must rely on something other than manual processes; we cannot hope that people will remember to tag their resources correctly or that we will be able to pre-empt our utilisation for the next 12 months accurately.
The cloud moves quickly and people simply can’t keep up; from start-ups to enterprise estates, things scale and change way faster than we can predict or act, which is why FinOps needs to form part of the delivery process, just as much as security.
Modern organisations automate their security. CSPs have made this fairly simple by providing tools to assist in doing so, and the same is true for FinOps. Adding automated cost management into software and infrastructure pipelines shifts FinOps left, minimising engineering impact and reducing cognitive load.
Automation will also help keep up with the cloud. For example, if a cheaper EC2 instance is added, you want the tool to tell you about that and not have to rely on someone being up to date with every cloud change.
When discussing spending, you might think about compute resources, but we often forget that everything you do in the cloud has a cost, from bandwidth to API calls to virtual entities like secrets. Then there are factors you cannot control with the weekly review — how often have we seen messages online like "Because of X, I've received a huge cloud bill that’s much larger than I expected"?
Sometimes your cloud provider will help you and offer a discount, but that's not always the case. Again, this is when automation can help you. You can and should configure your cloud to send alerts when your bill exceeds a fixed limit, but even that should not be a one-off.
You might think FinOps is just about monitoring your production environment and applying corrective actions. But what if we tell you that FinOps should start as close as possible to your development environment?
We’ll give you one example. Imagine a developer changing a line of code and triggering double the cloud API calls for the same action. This might not seem like a big deal, but if we allow that change into production where that API is triggered hundreds of thousands of times 'on Friday' to generate weekly reports then… I suppose you get our point!
What you want in general is the shortest possible feedback loop between your cloud environment and your engineering team. Having something as simple as an alert saying, "You just doubled your cost for this test case" might have saved you from this incident.
Apply weights to your test cases measuring the cost impact of the same API call in production, and you have an excellent preemptive system at your disposal.
In the same way that you monitor your system for critical metrics (e.g. latency) and have your engineering team exposed to them, so they can spot changes right after deployments, you want your team exposed to FinOps metrics continuously. And you must allow them to respond appropriately when code/infra changes trigger unexpected events.