Best Practices for Maximizing Spot Savings and Stability

Spot can save you up to 90% compared to On-Demand prices but it’s often perceived as unreliable.

That’s why we partnered with the AWS Spot team to give you an in-depth overview of how and where to use Spot. We’ll also explain how the nOps platform natively works with Spot to make it easy to confidently adopt Spot in your organization.

Stay tuned for:

Spot overview and how it works
Spot Best Practices and the ideal workloads for Spot
Getting started with Spot and balancing compute purchase types
How to run workloads on Spot with confidence
Real world example of before and after Spot Optimization

Download the full webinar here:

Maximizing Spot Savings and Stability: Webinar

What is AWS Spot and how does it work?

AWS Spot Instances offer an opportunity to utilize spare EC2 capacity at discounts of up to 90%. But what exactly is an EC2 instance? AWS provides a diverse range of over 750 EC2 instance types to suit the compute needs of virtually every single workload — from general-purpose to storage-optimized, memory-optimized, or those with specific processing capabilities like ARM-based instances with Graviton processors.

Typically, users can purchase EC2 instances in several ways:

On-Demand: This is the traditional and most flexible option, where users pay by the second without long-term commitments. It can be used even for unpredictable workloads that may spike unexpectedly. However, it is the most expensive option.
Savings Plans: Over time, you may know what you need in terms of compute footprint for an extended period of time. That’s where Savings Plans come in — these are suitable for long-running stable, predictable workloads, offering significant discounts in exchange for 1 or 3-year commitments.
Spot Instances: These are available at a significant discount as they utilize spare AWS capacity. However, AWS can reclaim these instances with just two minutes’ notice. They are best used with fault-tolerant, loosely coupled, stateless workloads architected to handle these interruptions.

A few Spot misconceptions

Firstly, let’s clear up a few of the most common myths we hear about Spot.

FALSE: Spot Instances involve bidding on prices. Bidding was removed in 2017. Today, Spot pricing is determined by current and long-term supply and demand, leading to relatively stable costs that are easier to predict. You no longer have to deal with the spikes and anomalies that sometimes existed when bidding was still in place.

FALSE: Spot Instances are a secondary, old, or bad instance type. In truth, Spot offers the exact same quality as On-Demand. They are a powerful cost optimization tool, especially when integrated with management solutions like nOps to enhance their reliability and usability within your organizational infrastructure.

FALSE: Spot is just for extreme cost optimization. Spot is good for a number of different types of workloads and not necessarily always just used from a cost perspective.

FALSE: Spot can’t be used in production. While Spot interruptions can occur, with appropriate workload qualification, planning and management many AWS customers successfully run Spot in production without any end user impact.

FALSE: Spot is difficult. We’ll discuss some best practices and solutions like nOps that help ensure that Spot is not difficult to implement.

Spot Best Practices & Tips

Let’s start by understanding some best practices that are absolutely key to Spot success.

Why Spot diversification matters

Understanding Spot starts with comprehending what we mean by “spare capacity.” In each AWS region and Availability Zone (AZ), AWS operates various specific instance sizes to maintain cloud elasticity, ensuring it can accommodate incoming requests for EC2. This inevitably results in some level of spare capacity—essentially unused resources.

The availability of Spot capacity can fluctuate, influenced by seasonal spikes or increased demand from particular organizations for particular instance types, affecting how much spare capacity is available and the potential cost savings.

To effectively leverage Spot, it’s important to open up your doors to as many of these different capacity pools as you possibly can. By flexibly employing a variety of instance types, sizes, and distributing your workloads across different AZs and regions, you can ensure you always have a backup plan in case of an interruption.

Interruptions only occur when AWS must reclaim instances for On-Demand or Savings Plans customers. You’ll receive a two-minute warning, typically sufficient to manage the transition or mitigate the impact on your operations by draining tasks or redistributing them to other capacity pools.

Let’s talk about some of the automations and integrations that can help you do this.

Spot integrations for ease of management

Spot has become significantly easier to adopt due to robust integrations with various AWS native services. This includes integrations with services like Amazon EKS, ECS, Auto Scaling, EC2 Fleet, and EMR, all designed to simplify the implementation of diversification strategies previously discussed and help mitigate the impacts of Spot interruptions effectively.

Beyond AWS native services, Spot has also found extensive application within various open-source tools. For instance, whether you’re using a managed service like Amazon EKS or self-managing your Kubernetes cluster, Spot instances can seamlessly integrate into your environment.

You can also use Spot with your own self-managed Kubernetes system without any issues. For example, Jenkins pipelines are excellent places to start integrating Spot. When you initiate a build, you need compute resources, and once completed, you want to shut them down again. Fortunately, Jenkins has a native integration with Spot, making it a seamless process.

Additionally, Spot-ready partners play a crucial role in facilitating the broader adoption of Spot Instances. These partners, including nOps, have collaborated closely with AWS to incorporate all these best practices of diversification, workload qualification, and interruption-handling directly into their platforms. These partnerships enable customers to leverage EC2 Spot confidently, optimizing costs without compromising on performance or availability.

What is nOps Compute Copilot?

nOps Compute Copilot is an intelligent workload provisioner. It continuously manages, scales, and optimizes all of your AWS compute to get you the lowest cost with maximum stability.

In line with AWS recommendations, Compute Copilot was not built on proprietary auto-scaling technology but integrates with your preferred and existing AWS native services.

Native integration is preferred because it means that for your existing workloads, there’s very low overhead to migrate them to Compute Copilot. Copilot will automatically and sensibly apply commitments like Savings Plans and Reserved instances. It will continuously tune and optimize your workloads for you, putting your configurations for ASGs, EKS auto-scaling, ECS auto-scaling, and Batch on autopilot without constant effort from your engineering teams.

A sidenote on the benefits of Karpenter

Many organizations running workloads on EKS or self-managed Kubernetes on AWS are hearing more and more about Karpenter, which recently went GA. Karpenter is the most advanced EKS node provisioning framework currently available on the market today.

One of the biggest problems when you move to a container orchestration platform like Kubernetes is making sure that you can efficiently provision your containers and pack them onto your AWS compute instances. Unlike traditional methods that rely on standard-sized instances, Karpenter selects the most optimally sized node or EC2 instance based on specific workload needs, choosing from a variety of instance types and sizes to ensure optimal sizing.

Additionally, Karpenter enhances container consolidation by actively seeking opportunities to shut down nodes that are no longer optimally sized or utilized. It also features native support for AWS Spot tools, seamlessly integrating with the AWS Spot ecosystem.

At nOps, we’ve purpose-built the Compute Copilot to facilitate easy integration with Karpenter. For organizations currently using Cluster Autoscaler, we’ve successfully assisted numerous customers in transitioning to Karpenter. Compute Copilot enhances organizational awareness within EKS clusters, tuning them for optimal performance based on commitment inventory insights and trends within the Spot market. Additionally, it automates Karpenter configuration, reducing the need for engineering teams to continuously audit your Karpenter settings.

How to run on Spot with confidence with Compute Copilot

Whether you prefer managing your infrastructure through code or opting for complete automation via our platform, nOps supports both approaches and integrates seamlessly with tools like Terraform and Git to continuously update your cluster’s configurations.

Let’s talk about some of the ways nOps makes running on Spot easy.

EKS observability and management: Compute Copilot provides an additional layer of insight and control over your EKS clusters. It gives you a high level of visibility into cluster operations, simplifying the configuration and ongoing management processes.

You can see how your instances are diversified across different instance families, types, and sizes. You can also view your Spot termination rate, cluster efficiency, actual cost and usage, and other key metrics to measure success.

Intelligent Instance Selection: Compute Copilot makes it easy to select instances based on workload requirements, with an intuitive interface that simplifies the translation of your workload requirements into specific architectural needs

Often, engineering teams might select a specific instance family, like C5s, deemed suitable for their workload. However, other variants within the same family, such as C5as and C5ds, along with other general-purpose compute types like Ms and Rs, might also be architecturally appropriate and could enhance performance. Compute Copilot not only identifies these options but also assists in configuring the essential parameters—CPU, bandwidth, and GPU considerations. This relieves your engineering team from the need to become experts in the vast array of instance options available from AWS and ensures access to a wide range of eligible instance types, sizes, and families to fulfill the Spot best practice of diversity.

Continuous optimization of Spot and Commitments. Compute Copilot dynamically adjusts to changes in your environment, ensuring you’re always on the perfect blend of Spot, Savings Plans, and Reserved Instances. It fully automates the commitment management process, eliminating the need for extensive planning exercises to determine purchases. nOps guarantees 100% utilization of those commitments, ensuring you never have to worry about overcommitting.

Real-Time Workload Reconsideration & Graceful Pod Rebalancing. Compute Copilot introduces best practices for migrating workloads to new nodes, ensuring they gracefully handle all of the many changes taking place in an auto-scaling environment. Additionally, Compute Copilot intelligently manages the distribution of workloads across availability zones, continuously directing more compute resources to zones where performance is optimal and pricing is favorable. In line with Spot best practices, it is very proactive about shifting workloads to fresh instances to automatically maximize the health and stability of your workloads.

Real world example of before and after Spot Optimization

In this case study, we’ll discuss a real nOps customer and how they migrated to Spot.

To set the stage, this particular organization had a fairly large AWS (specifically EKS) usage, running many varied workload types: Machine Learning, APIs, containers, with a large SaaS product doing a lot of Batch Data processing. In short, they had a large and diverse engineering team and a highly dynamic environment, looking for the best performance and costs.

The problem:

Savings Plans could only cover a small portion of workloads due to lots of autoscaling
Spot and Commitments are difficult to balance together
It was difficult to identify workloads that ‘just work’ with Spot
They needed a more advanced node-provisioning framework (Karpenter)

The solution:

Step one was that nOps made the migration to Karpenter super easy. The organization adopted Karpenter across all of their environments, from lower dev environments all the way to production in just a couple of weeks.

We collaborated with them to complete a proof of value in some of their lower environments across various workload types. This process provided them with the visibility and confidence needed to start deploying the Compute Copilot and integrating a balance of Spot and Commitment Management into their higher environments.

In March, their compute or price optimization through Spot was around 12.5%. By June, this figure had increased to 50.5%, with a significant portion of their production, mission-critical workloads contributing to this achievement. This meant an effective monthly savings rate of $66,991.

These stellar results illustrate how easy Compute Copilot makes it to confidently introduce Spot as one of the tools you can use to achieve maximum price optimization.

To join our customers using nOps to leverage Spot with complete confidence, book a demo today!

Best Practices for Maximizing Spot Savings and Stability

Maximizing Spot Savings and Stability: Webinar