AWS Spot Instances are spare AWS capacity that users can purchase at a heavy discount from On-Demand (up to 90%). However, the catch is that AWS does not guarantee that you’ll be able to use a Spot instance to the end of your compute need. When a user willing to pay the full On-Demand price emerges, AWS can terminate these instances with a two-minute warning (commonly known as a Spot instance interruption).
These unexpected interruptions can cause workloads to fail. That’s why many companies fear to use Spot despite the significant potential cost savings.
At nOps, not only do we run our own production on Spot, but we also manage over $1.5 billion in AWS spend for our customers. This vast experience has provided us with extensive data on Spot usage, enabling us to develop a robust methodology for reliably leveraging Spot to save substantially on your total compute cost.
In this article, we’ll share five crucial Spot facts that inform our best practices for leveraging Spot Instances effectively.
1. Lowering your overall termination rate is the key to success
Spot instances may last from a couple of minutes to a couple of days or even months, but eventually they will disappear.
In fact, no EC2 instance (including On-Demand) lives forever — they will all always die. That means that the key to Spot success is lowering your overall termination rate to an acceptable level.
High-Availability architecture is essential and has been around for decades. (Most compute in the cloud is dynamic and runs behind an Elastic Load Balancer or with Container Orchestration such as Kubernetes, allowing it to scale and handle interruption).
But the other key piece for achieving reliable Spot savings is lowering your overall termination rate. Instance diversity, investment in better performing availability zones, and using architecture-equivalent but less popular resources can all dramatically reduce termination risk and contribute to maximum success.
2. The Spot reaper comes in batches
The death of a Spot instance is not as simple as it might seem. Typically, Spot instances are terminated in groups. For example, if I have a M5.large in AZ1 and it dies, typically a whole group of M5.larges in AZ1 will die too.
Batch terminations of Spot Instances are commonly observed anecdotally and confirmed by our data.
We pulled some real and very recent data on Spot instance interruptions to illustrate. This chart shows termination events that occurred in a thirty minute window for specific instance types in a specific Region and Availability Zone.
Period | Region | Instance Type | Terminations |
2024-05-08 – 17:30 to 18:00 | us-east-1 | r5.xlarge | 1445 |
2024-05-04 – 19:30 to 20:00 | us-east-1 | r6i.xlarge | 696 |
2024-05-02 – 23:30 to 00:00 | us-east-1 | c5.9xlarge | 349 |
2024-05-01 – 04:30 to 05:00 | us-west-2 | i4i.2xlarge | 136 |
Observe that a significant number of termination events occur in a short period — when these batch terminations occur, a very significant number of these particular instance types disappear. Many of those who were betting on r5.xlarge at 5:30 p.m. on May 8, 2024 lost that bet at that moment.
Why does this occur? We can speculate — consider the sophisticated landscape in which AWS functions. With an expansive array of instances running simultaneously, each in its unique configuration, AWS’s operational strategies are understandably complex. This observed pattern likely reflects an operational efficiency tactic, where instances of similar types might be grouped and managed collectively to streamline resource allocation and system stability. Although we cannot definitively state this as a rule, the pattern aligns with the operational behaviors often seen in large-scale cloud environments, suggesting a strategic approach to maintaining such an extensive infrastructure.
3. Refresh Spot instances so they are at different stages in their lifecycle
Another risk that can impact your workload is having all of your Spot instances at the same point in their lifecycle. If so, multiple instances are likely to be terminated at the same time — causing downtime.
However, this is easier said than done — how do you know how long each Spot instance type is likely to live?
While the lifecycle of any given Spot instance is difficult if not impossible to predict on your own, nOps manages $1.5 billion in cloud spending. As such, we can analyze very long periods of massive amounts of Spot data to detect long-term patterns of population and seasonal behaviors for instance families, given region, instance type, availability zone, and other factors to get a very good idea of how long a Spot instance is likely to live.
By gracefully and continually refreshing Spot instances before they are likely to be interrupted, you can greatly increase your reliability.
4. Treat your Spot strategy like an investment strategy
When you’re investing in the stock market, a best practice is to diversify your investments. Don’t put all your money into one stock, or one industry; that’s extremely risky. On the other hand, if you buy a broad range of stocks, you’re probably safe.
The takeaway is that you need to diversify your Spot Instances across various instance types, availability zones, and lifecycle stages to reduce the impact of market fluctuations on your workload. Don’t keep all of your eggs in one Spot basket.
5. Architect for replacement: staying ahead of Spot terminations
Mr. Meeseeks is a character from the TV show “Rick and Morty.” He is summoned by pressing a button on the Meeseeks Box and exists to fulfill a specific task. Once his task is completed, he disappears. Mr. Meeseeks is cheerful and helpful, but becomes increasingly unstable and desperate if the task takes too long to complete.
Much like Mr. Meeseeks, the nature of a Spot instance is to appear, accomplish a task, and then vanish. Spot instances do not want to live forever; the longer that you keep Spot alive, the higher the risk of interruption. In other words, the faster you complete your task, the better. If you don’t wait to be terminated but jump in and jump out of your own accord, you’ll greatly reduce your involuntary interruptions.
This is where Data Science and Machine Learning can help. Trying to predict whether your Spot instance is going to end is a fundamentally flawed approach — we know the instance will end. What you need to know instead is how long it will be before the Spot instance dies.
This information can help you to dramatically decrease the amount of downtime your workload is subjected to. The strategy involves (1) running your workload on a diverse set of instances (2) gracefully terminating each one before the likely end of its lifecycle (3) refreshing terminated instances with new Spot instances.
Now, I’d like to share some real Spot data with you. This chart shows a comparison between Spot instances that were terminated by AWS, versus Spot instances that were purposely and gracefully terminated by users of Spot in 2024.
Region | Terminated by AWS | Shutdown by User | Terminated Instance Lifetime | Shutdown by user (not terminated) Instance Lifetime | % difference of avg lifetime for terminated Spot intances |
us-east-1 | 308,450 | 6,095,872 | 398 min | 87 min | -78% |
us-west-2 | 201,543 | 4,009,879 | 566 min | 80 min | -86% |
us-east-2 | 24,572 | 293,438 | 933 min | 356 min | -62% |
eu-central-1 | 14,869 | 98,255 | 447 min | 184 min | -59% |
eu-west-1 | 4,771 | 2,215,149 | 1630 min | 33 min | -98% |
eu-west-2 | 3,616 | 38,789 | 610 min | 256 min | -58% |
As you can see, the lifetimes of Spot instances that were gracefully terminated were significantly shorter on average than the lifetimes of Spot instances that were involuntarily interrupted by AWS.
The longer you run on Spot, the higher the chance of interruption by AWS. Gracefully terminating instances early drastically reduces the rate of Spot interruptions, making it safer and more reliable to run on Spot.
Running on Spot is easier and safer with nOps
If you’re looking to take advantage of Spot discounts, increase reliability, and reduce management overhead, nOps can help.
We analyze massive amounts of proprietary Spot market and historical data with Machine Learning to predict how long Spot instances will live with a high degree of accuracy. In fact, our SLAs are equivalent to AWS’s On-Demand SLAs.
nOps Compute Copilot continually analyzes the Spot market and the requirements of your workload. We continually analyze your usage patterns to identify a personalized set of Spot instances that will be optimally reliable for your workloads and your workloads only. Copilot then automatically and continually moves your workloads onto diverse and less risky instance types to drastically reduce the amount of involuntary interruptions that occur, making it orders of magnitude easier and more reliable to use Spot.
With nOps, there’s no need to know instance types, monitor the Spot market, or manually manage workloads — we do it all for you.
Here are the key benefits:
- Effortless cost savings. Engineered to consider the most diverse variety of instance families suited to your workload, Copilot continually moves your workloads onto the safest, most cost-effective instances available without manual intervention.
- Awareness of your commitments. Copilot analyzes all of your commitments across your infrastructure to find the cost-optimal blend of RI, SP & Spot — get effortless discounts on all of your compute.
- No vendor-lock in. Just plug in your preferred AWS-native service (EC2 ASG, EC2 for Batch, EKS with Karpenter or Cluster Autoscaler…) to start saving effortlessly, and change your mind at any time.
- No upfront cost. You pay only a percentage of your realized savings, making adoption risk-free.
Join our customers using nOps to cut cloud costs and leverage automation with complete confidence by booking a demo today!