shivansh vij, founder & CEO
Published Dec 27, 2024
We've solved the fundamental challenges that previously made spot instances impractical for most workloads. Architect enables any application to run on spot instances with zero downtime and zero data loss during preemption - all without any code modifications. This immediately enables 75%+ cost savings across your infrastructure while maintaining production-grade reliability.
Want to see it in action? Try Architect for free with GitHub Actions and watch your CI/CD workflows run uninterrupted on spot instances within your own cloud account.
For years, spot instances have promised dramatic cost savings for compute-heavy workloads (often being 75%+ cheaper than on-demand instances) - but they come with a critical caveat: they can be reclaimed by the cloud provider at any moment.
This has made them suitable only for use with stateless workloads or batch jobs that are designed to handle random interruptions, but impractical for anything else.
This has created an impossible situation. Those applications that would benefit most from cost savings - those running 24/7 with high compute costs - are exactly the ones that can’t risk sudden node interruptions, and therefore cannot take advantage of spot instances.
That changes today.
Cloud spot instances are unused capacity available at steep discounts (more than 75% in most cases). They’re available across all major cloud providers and can be configured exactly like regular on-demand VMs - you select an instance type, configure networking, attach storage, and deploy your applications.
They don’t require any up-front monthly commits, and even have the same performance as on-demand instances - so what’s the catch?
Unfortunately, cloud providers can terminate and reclaim spot instances (known as preemption) with only a few seconds of notice whenever capacity is needed elsewhere. To make matters worse, the probability of being preempted can also change dramatically - swinging from 3% to over 60% depending on the time of day, week, region, and instance type. This volatility makes it nearly impossible to predict or mitigate the preemption risk through traditional means like overprovisioning or autoscaling.
All of this results in a fascinating paradox: while spot instances could theoretically save organizations enormous amounts on their infrastructure costs, they're too volatile to use for anything that matters.
The applications that would benefit most from spot instances - stateful workloads like databases, ML training jobs, and game servers - are precisely the ones that can't tolerate random interruptions. The potential cost savings are tantalizing, but the operational risks are too high.
For some workloads you can try to work around these limitations, but it requires a significant amount of extra work.The traditional solution to this problem revolves around checkpointing - periodically saving application state to persistent storage, and when preemption occurs, you restore from the last checkpoint on a new instance. But this approach has some fundamental flaws:
These limitations have effectively confined spot instances to a narrow set of use cases - primarily stateless workloads and batch jobs that can handle interruptions. The promise of spot instances has remained largely unrealized.
We decided to look at this problem from first principles - instead of trying to work around preemptions, we wanted to make them irrelevant altogether.
Architect enables any application to run on spot instances by automatically migrating them to new compute resources within the preemption window - with zero downtime, zero data loss, and zero modifications to your code.
This isn't just an incremental improvement - it's a complete paradigm shift in how we think about infrastructure reliability. Applications that previously required dedicated instances can now run on spot without compromise. Your databases, game servers, microservices, ML training jobs - everything just works.
Architect takes a fundamentally different approach to handling spot instance preemption. Rather than relying on periodic checkpoints, Architect continuously captures the complete state of your running application - including CPU, memory, and GPU state. This state is efficiently serialized and moved to a passive store like an EBS volume or S3 bucket, and is used during the live migration to reduce the amount of data we need to evacuate off of the spot instance during the preemption window.
When a preemption does occur, here's what happens:
This process happens automatically and transparently to your application. There are:
Your application simply continues running as if nothing happened. This opens up spot instances to be used for any workload without requiring modifications to application code. You can automate movement of a workload in production, without downtime and with a guarantee of its integrity.
While Architect’s immediate benefit is the ability to safely run any workload on spot instances, the implications of this technology go far deeper. This technology effectively decouples applications from their underlying infrastructure, enabling capabilities that were previously impossible:
These capabilities hint at a future where compute truly becomes a commodity.
Over the next few weeks we’ll be digging deeper into how we solved these challenges, as well as how Architect is implemented. If you’d like to stay up to date and learn more [sign up for our newsletter] and we’ll let you know when we post new content.
We're initially launching Architect with support for GitHub Actions, allowing you to reliably run your CI/CD workflows on spot instances in your own cloud account. This service automatically handles migration during preemption events, ensuring your builds complete without interruption.
You can sign up for early access to see Architect in action and start saving on your CI costs today. Just add a few simple modifications to your GitHub repository, and we'll handle the rest.