Top 3 Lessons from Running 300+ Cost-effective Solutions on AWS

Public cloud spend is expected to reach $400 billion in 2022 and out of that 30-40% is wasted. That is $120 billion to $160 billion – compare this figure with the expected size of the global frozen pizza market of $17.96 billion in 2022.

Note: this post focuses on AWS since I am an AWS Community Builder in Machine Learning and am focusing exclusively on AWS. The identical message applies for all other clouds.

Lesson #1 – Solve the Right Problem

The amount of obvious waste is a golden market opportunity for various solutions offering instant cloud fixes to their clients’ cloud environments with varying success, and foundations providing their believers with ready-to-apply FinOps frameworks whose introduction may cause more harm than benefits.

In fact, the key problem with these approaches is that they solve the wrong problem – the right problem to solve is how to design cloud architecture in the way which makes most of the cloud unique features. The waste is the consequence of poor cloud architecture and/or development processes and fixing symptoms rather than the real cause only exacerbates the actual problem.

There is no shortcut, no AWS Made Easy, you cannot design an optimal cloud architecture without thorough and always updated knowledge of the cloud. Moreover, public cloud is an amplifier – due to its infinite elasticity/scalability, it amplifies the consequences of any good and poor decisions.

An example we saw in several solutions was misusing AWS Aurora RDS database for storage and processing of queue records. This led to the situation of gigantic required IOPS on the database which the Cost Optimization team (FinOps) addressed by upgrading the Aurora instance size and by purchasing a reserved instance for the instance. Naturally, the right solution is to move the queue to AWS SQS at a fraction of maintenance and AWS costs.

We believe in a comprehensive approach to using AWS, compatible with the AWS Well-Architected Framework, and do not believe Cost Optimization or FinOps should drive AWS usage. FinOps is reactive, be proactive.

Recommendation: focus on upgrading your cloud architectural knowledge rather than implementing any non-actionable FinOps Framework. I have found no better alternative than AWS Certification.

Lesson #2 – The Best Technical Decisions are not Good Enough

Some firms pride themselves on all their architectural decisions being made purely based on technical reasons and ignoring any cost considerations. This is a fantasy resulting in their cost optimization team achieving cost savings of 60% or more. In addition, such a situation is usually accompanied by the “best technical decisions” not being the best since they force you to ignore some critical aspects of the problem.

An analogy would be that if you want to travel from Paris to London, you should always travel by private jet since it saves you most time, in-person meetings are always preferrable, and you enjoy traveling in comfort.

Let’s consider an example of how to store a phone number in AWS so that it can be publicly accessed?

There are an infinite number of solutions with vastly different performance, security, maintenance, and immediate and long-term costs
• Use AWS Route 53 (TXT record)
• Use AWS S3
• Use AWS EFS + AWS Lambda + AWS API Gateway
• Use a camera connected to AWS Panorama device looking at your business card + AWS Direct Connect + AWS Textract + AWS API Gateway
• Use some satellites….

This simple example illustrates that there is no such a thing as the best technical decision taken in isolation of costs, security, maintainability and … → all architectural decisions are made in the business context. Thus, the choice of the right resource type for the job, or any architectural decision depends on these two questions:

  • Question 1 – Technical Decision → The list of technically best solutions
    • Consider the Compatibility of the Performance and Security Profile of the resource type with the expected long-term patterns and scale of use – notice long-term means the next few months since the optimal architecture changes every few months
  • Question 2 – Economic Decision → The actual choice
    • Given the answers to Question 1, factor in the total long-term cost of ownership
      • Adoption, Setup, Maintenance, Migration Costs
      • Increased / Reduced costs by simplifying / making more complex architectural choices
      • Implied costs of technology lock-in

Recommendation: all architectural decisions should be based upon the cost/performance criterion.

Lesson #3 – Adopt the Cloud-Native Mindset

To make the most of the cloud opportunities, you need to adopt the cloud-native mindset which exhibits with

  1. Cloud is elastic meaning that cloud costs should closely follow revenue.
  2. Always choose the right resource type for the job – for example, use AWS Graviton instances for most use cases.
  3. Cloud is about rapid experimentation and innovation.
  4. Expert human work is more expensive than any cloud cost.
  5. You should be using managed services for most parts of the architectures even if they deliver just 90% of performance of self-managed services since they offload manual management and save time.
  6. Most of the code of modern applications is inter-connecting existing APIs, and you should aim at minimizing your own code.
  7. Scalability means that the resources are disposable, so you should store all data decoupled from the processing resources. Data should be stored in standard formats so that it can be accessed with other solutions.
  8. Design your solutions with security and fault tolerance in mind.
  9. Observability, automation, and constant benchmarking are critical for achieving optimal evolving architecture. Real-time monitoring, metrics, and logging deliver massive business value. Incorporate custom metrics into all applications to help answer questions, such as the costs per customer, costs per feature (“unit costs”). Benchmark across older version of each solution and across similar solutions to identify inefficiencies.
  10. All business decisions are driven by forecasts. The monitoring and logging free you from making blind estimates.
  11. You should review the architecture and re-architect the solutions every 3 months to make the most of the new services and features.