Rethinking Spot Instances - How We Solved the Preemption Problem header
announcement

Rethinking Spot Instances - How We Solved the Preemption Problem

SHIVANSH VIJ

shivansh vij, founder & CEO

Published Dec 27, 2024

TL;DR:

We've solved the fundamental challenges that previously made spot instances impractical for most workloads. Architect enables any application to run on spot instances with zero downtime and zero data loss during preemption - all without any code modifications. This immediately enables 75%+ cost savings across your infrastructure while maintaining production-grade reliability.

Want to see it in action? Try Architect for free with GitHub Actions and watch your CI/CD workflows run uninterrupted on spot instances within your own cloud account.

Rethinking Spot Instances

For years, spot instances have promised dramatic cost savings for compute-heavy workloads (often being 75%+ cheaper than on-demand instances) - but they come with a critical caveat: they can be reclaimed by the cloud provider at any moment.

This has made them suitable only for use with stateless workloads or batch jobs that are designed to handle random interruptions, but impractical for anything else.

This has created an impossible situation. Those applications that would benefit most from cost savings - those running 24/7 with high compute costs - are exactly the ones that can’t risk sudden node interruptions, and therefore cannot take advantage of spot instances.

That changes today.

The Spot Instance Paradox

Cloud spot instances are unused capacity available at steep discounts (more than 75% in most cases). They’re available across all major cloud providers and can be configured exactly like regular on-demand VMs - you select an instance type, configure networking, attach storage, and deploy your applications.

They don’t require any up-front monthly commits, and even have the same performance as on-demand instances - so what’s the catch?

Unfortunately, cloud providers can terminate and reclaim spot instances (known as preemption) with only a few seconds of notice whenever capacity is needed elsewhere. To make matters worse, the probability of being preempted can also change dramatically - swinging from 3% to over 60% depending on the time of day, week, region, and instance type. This volatility makes it nearly impossible to predict or mitigate the preemption risk through traditional means like overprovisioning or autoscaling.

All of this results in a fascinating paradox: while spot instances could theoretically save organizations enormous amounts on their infrastructure costs, they're too volatile to use for anything that matters.

The applications that would benefit most from spot instances - stateful workloads like databases, ML training jobs, and game servers - are precisely the ones that can't tolerate random interruptions. The potential cost savings are tantalizing, but the operational risks are too high.

The Traditional Approach: Why Checkpointing Isn't Enough

For some workloads you can try to work around these limitations, but it requires a significant amount of extra work.The traditional solution to this problem revolves around checkpointing - periodically saving application state to persistent storage, and when preemption occurs, you restore from the last checkpoint on a new instance. But this approach has some fundamental flaws:

  • Incomplete State: You lose all work since the last checkpoint
  • Application Complexity: Your code needs to implement checkpoint/restore logic
  • Time Constraints: Restoration process often takes longer than the preemption notice window
  • Service Disruption: Network connections drop during migration
  • State Consistency: Managing checkpoints becomes a nightmare at scale

These limitations have effectively confined spot instances to a narrow set of use cases - primarily stateless workloads and batch jobs that can handle interruptions. The promise of spot instances has remained largely unrealized.

Enter Architect: Making Preemption Irrelevant

We decided to look at this problem from first principles - instead of trying to work around preemptions, we wanted to make them irrelevant altogether.

Architect enables any application to run on spot instances by automatically migrating them to new compute resources within the preemption window - with zero downtime, zero data loss, and zero modifications to your code. 

This isn't just an incremental improvement - it's a complete paradigm shift in how we think about infrastructure reliability. Applications that previously required dedicated instances can now run on spot without compromise. Your databases, game servers, microservices, ML training jobs - everything just works.

How It Works

Architect takes a fundamentally different approach to handling spot instance preemption. Rather than relying on periodic checkpoints, Architect continuously captures the complete state of your running application - including CPU, memory, and GPU state. This state is efficiently serialized and moved to a passive store like an EBS volume or S3 bucket, and is used during the live migration to reduce the amount of data we need to evacuate off of the spot instance during the preemption window. 

When a preemption does occur, here's what happens:

  1. Architect receives the preemption notice from the cloud provider
  2. Available target compute resources are identified and acquired (this could be additional spot instances or other types such as reserved or on-demand)
  3. Architect begins streaming the application's live state to the new instance, stitching together the previously captured state from the passive store with any live or in-progress changes
  4. Architect automatically reroutes network traffic to the new resource without dropping connections
  5. The entire migration completes before the preemption deadline

This process happens automatically and transparently to your application. There are:

  • No code changes required
  • No checkpoints to manage
  • No data loss
  • No service interruptions
  • No dropped connections

Your application simply continues running as if nothing happened. This opens up spot instances to be used for any workload without requiring modifications to application code. You can automate movement of a workload in production, without downtime and with a guarantee of its integrity.

Beyond Cost Savings

While Architect’s immediate benefit is the ability to safely run any workload on spot instances, the implications of this technology go far deeper. This technology effectively decouples applications from their underlying infrastructure, enabling capabilities that were previously impossible:

  • Seamless multi-cloud deployments
  • Zero-downtime hardware upgrades
  • Geographic optimization for latency
  • True infrastructure portability

These capabilities hint at a future where compute truly becomes a commodity.

Over the next few weeks we’ll be digging deeper into how we solved these challenges, as well as how Architect is implemented. If you’d like to stay up to date and learn more [sign up for our newsletter] and we’ll let you know when we post new content.

Getting Started

We're initially launching Architect with support for GitHub Actions, allowing you to reliably run your CI/CD workflows on spot instances in your own cloud account. This service automatically handles migration during preemption events, ensuring your builds complete without interruption.

You can sign up for early access to see Architect in action and start saving on your CI costs today. Just add a few simple modifications to your GitHub repository, and we'll handle the rest.

SHIVANSH VIJ

Written By

shivansh vij, founder & CEO

Follow the Author