FinOps practitioners and technical teams must begin working more closely together to factor financial reporting into their architectural designs, tagging and labelling taxonomies, and other configurations. When they do, there can be significant upside to a company. When they don’t, the consequences can be damaging.
Here’s a case study that illustrates the need perfectly:
Recently, our large financial services client was struggling with serious challenges managing their cloud spend. Its first challenge was a total inability to charge back cloud costs to business units. Far more problematic was the fact that it had no means of tracing costs to either products or customers, which prevented accurate gross margin calculations for either. The company had approximately 25 cloud vendor billing accounts but 80% of the spend was with two of them, and yet it had little visibility into what that money was being spent on.
In addition, its cloud costs were growing disproportionately relative to revenues, and it had a strong feeling there was a considerable amount of waste driving the trend. But the lack of visibility into the spend prevented the company from tracking down and eliminating waste.
We engaged with the client to begin addressing the problem. The process began by identifying the right individuals from a variety of business functions to serve part time on the FinOps team. Because their spend was $20-$25 million per year, they reassigned one individual from their technical team to be 100% dedicated to FinOps. With the team established, the company began to analyze its data.
Working with the FinOps Team
Because FinOps practices depend on comprehensive and accurate segmentation of spend, we teamed up with their FinOps team to attack this challenge first. They documented resource naming conventions and tagged values that could be used to “map” resources to business units. This mapping formed a table in their database, which, using “if-then” logical queries, allowed them to enrich the billing data with business segmentation information and generate reports.
As the work progressed, the team was able to trace a higher percentage of its costs to business units until they ran into a roadblock: certain cloud services were billed as single items (specifically, compute hosts with multiple containers running on them), but in reality these items served multiple customers and hosted different services. The billing data could not be used to identify what portion of those shared resources were accounted for by which products and which services.
In the intermediate term the only solution was to develop a separate report using an API for the software (Kubernetes) running on the compute instances. The API enabled them to track utilization of specific containers on each compute host. Next, a performance metric had to be chosen to define the proportion of a host’s resources that were consumed by customers and services. Options considered included CPU utilization, amount of storage used, amount of memory reserved, and amount of memory consumed.
Looking at the Process
The decision was made to choose the amount of memory reserved by each container as the metric. This made sense because whether or not a container actually consumed its full memory allocation, that allocation was unavailable to other containers and so could be considered “used.”
In other words, if containers for customer X reserved 10% of a host’s memory, then Customer X was “charged” with 10% of the cost of the host. The process was effective but problematic in several ways:
- It imposed a cumbersome reporting burden on technical team members that distracted from their mainline responsibilities.
- The report took time to execute, delaying production of a full cost report each month.
- The reports generated by the technical teams had to be combined with data provided by the FinOps cost management infrastructure to generate chargeback reports. This post-processing meant that users of the company’s cost management tooling had no real-time visibility into chargeback values. Theoretically, they could ask for a copy of the final report, but the reports were only run monthly for accounting and finance purposes.
Although the API-based workaround generated the report the company needed, the process was not sustainable. The technical team informed the FinOps team that the only real “fix” was to restructure the workloads’ “namespaces” such that the costs could be directly traced within the billing data.
That project was extensive enough, however, that it had to be factored into the technical team’s overall sprint cycle and would defer other necessary engineering projects. The refactored namespaces were finally implemented two months after the API-based workaround had begun to be used, solving the problem.
The outcome was positive in the end. They were able to segment their spend and trace costs to customers, services and to the engineering teams that manage them. This enabled them to act on optimization initiatives much more effectively, shut down abandoned non-production workloads, and optimize their participation in vendor discount programs because they were able to gather usage forecasts from each responsible party.
The result? Although their revenues continued to grow at the same pace, their public cloud costs levelled off, with the astonishing result that they had saved a cumulative total of $9 million within the first year compared to the amount prior trend lines predicted they would have over the same period.
Despite the positive outcome, the company only regained control of its cloud spend after months or years of operating with such limited visibility that the cumulative waste they incurred totalled well into the seven figures.
Ultimately, that waste could have been avoided if the technical team had factored reporting requirements into its technical architectures. They had factored the technical merits of each new architecture, and they certainly factored cost efficiency into their technical decisions as well. What they failed to do was to add a “third leg” to this stool: reporting requirements.
Best Practice for Architecting for FinOps
As the use of shared-resource technologies such as containerization and shared database hosts, etc., grows, the need for “architecting for FinOps” will grow as well. Solutions architects need to make a standard practice of architecting for FinOps by taking these steps:
Polling various consumers of financial reporting in the organization for their reporting needs for new workloads during the architecture phase. In the above case, doing so would have shown that the finance teams would need to measure cost by customer and cost by service, for example.
With this knowledge, they could have established a namespace schema for their containerized workloads that gave this visibility from the start. They would also have learned that demonstration and proof-of-concept instances would need to be tagged to customers so they could be terminated when they were no longer needed.
Solutions architects should avoid the same mistake first by identifying the various parties with a need to consume reports on cloud spend data such as senior management, finance, accounting, product owners and technical teams. They should hold working sessions with these individuals to inventory necessary reports. It is important to focus these discussions on reporting needs and make a distinction between reports that “may be nice to have” and those that are “vital.” Effort on the former should be subordinated to effort on the latter.
Identify the non-technical dimensions
Next they should identify the non-technical dimensions of the architecture that will support reporting requirements. These dimensions may include the alignment of organizers such as AWS billing accounts, GCP Projects or Azure Resource Groups to the reporting needs. Tagging and labeling taxonomies that support the requirements should also be created when they are preferable to billing organizers.
Critically, tagging and labelling taxonomies should always be “demand driven,” meaning that only those tag values that support necessary reports should be established. The burden imposed by tagging requirements on technical teams should be minimized. When it comes to labelling, an age-old management axiom applies: measuring business activity carries a cost, so unless a metric will change a business decision, do not measure it.
Whenever possible, employ automation to ensure that there is integrity in the application of billing organizers (billing accounts, etc.) and tagging / labelling. The only thing worse than poor tagging coverage or a lack of billing organizer hierarchy is a lack of integrity in the billing data!
Once resources wind up in the wrong billing organizers or inaccurate tagging values spread throughout the data, it can be extremely difficult to remediate the situation.
Identify key scenarios
Identify any scenarios where the reporting requirements require a change in the application of the cloud technologies themselves. These are most often encountered when shared services such as containerized workloads or shared database hosts are in use.
Revise the reference architecture as necessary to be sure the reporting requirements can be met.
About the author:
Rich Hoyer is the Director of Customer FinOps at SADA.