$ // Rethinking Spot Instances - How we Solved Preemption ... // Learn More

~/BLOG/announcements

Rethinking Spot Instances - Solving Preemption

TL;DR:

We've made spot instances practical for all Kubernetes workloads. Architect migrates pods to new nodes before spot instances are reclaimed - with zero downtime, zero client interruptions, and zero code changes. Get 75%+ cost savings on your entire Kubernetes infrastructure while maintaining production reliability.

Join our waitlist to run your production workloads on spot instances without any compromises.

The Spot Instance Problem

Spot instances are 75%+ cheaper than on-demand instances, but they can be reclaimed with just 30 seconds to 2 minutes notice. This makes them unusable for production Kubernetes workloads.

When a spot instance is reclaimed, every pod on that node experiences:

  • Forced termination
  • Rescheduling to a new node (if capacity exists)
  • Full application restart and initialization
  • Load balancer re-registration
  • Client request failures during the transition

Even "stateless" workloads aren't immune. Your users experience errors, timeouts, and degraded performance every time a node is preempted. The unpredictability makes it worse - preemption rates can swing from 3% to over 60% based on time, region, and instance type.

The result? Teams keep paying full price for on-demand instances because reliability matters more than cost savings.

Why Traditional Approaches Fail

Kubernetes provides mechanisms to handle node failures - pod disruption budgets, graceful termination, and cluster autoscaling. But none of these prevent service interruption during spot instance preemption.

Common workarounds all have critical flaws:

  • Keep extra replicas: Defeats the cost savings
  • Fast autoscaling: Still causes client interruptions during pod startup
  • Graceful shutdown handlers: Can't prevent the pod from being terminated
  • Mixed node pools: Reduces but doesn't eliminate the problem

You're still forced to choose between cost and reliability.

Enter Architect: Seamless Pod Migrations

We took a different approach. Instead of managing the chaos of preemption, we eliminate it entirely.

Architect continuously snapshots your running pods without interrupting them. This results in Architect always having the complete state of your application on-hand before anything happens. When a spot instance preemption notice arrives, we migrate your pods to new nodes before the original instance terminates. Your applications keep running, connections stay alive, and clients never notice.

This isn't a restart or a reschedule - it's true live migration for any Kubernetes workload.

How It Works

Under the hood Architect is a relatively simple state machine who's only job is to keep your application running under any circumstances. Architect fundamentally moves through four possible states:

  1. Normal Operation: In this state Architect is continuously capturing pod state and syncing it across nodes. This is inherently why our migrations are instantaneous - we're amortizing the migrations ahead of time. During this time we're also carefully watching the resource utilization across the cluster to make sure we have enough resources to handle any sudden migrations that come up.

  2. Preemption: When a preemption notice from the cloud provider arrives (between 30 seconds and 2 minutes depending on the provider) Architect goes to work moving the nodes. We mostly rely on the Kubernetes scheduler for this part because we've already been coordinating with the scheduler in advance by way of ballast pods and daemonsets. This allows us to be confident that there are enough resources in the correct places to not require a node scale-up.

  3. Migrations: This is where Architect earns its keep. Once we have the resources scheduled (ie. the pods themselves) have been rescheduled and all ancillary resources have been recreated (file descriptors, PVC attachments, GPU resources, etc.) it's time to move the state. This is relatively instantaneous thanks to all the work we did ahead of time.

  4. Rerouting: Migrations aren't just supposed to be seamless for the application - they should be seamless for clients as well. To achieve this Architect gets to work updating our XDP-based routing layer to transparently reroute connections. This is done in a way so that no packets are dropped and the client can't even tell that the workload has migrated.

The end result is a completely seamless migration experience - neither your clients nor your application can tell that it's now running on a completely new node. The best part is that this entire process requires no code changes, just helm install and your infrastructure is spot-ready.

──/~\ Architect

──Optimize cluster costs and maximize node utilization, all without modifying your applications or your infrastructure.

── Join the waitlist:
──

See It In Action: Live Demo at KubeCon NA 2024

This isn't theoretical. At KubeCon North America 2024, we demonstrated live pod migration across multiple cloud providers (AWS, GCP, and Azure), all while maintaining active client connections:

Inherently Compatible

Architect is inherently compatible with any type of workload and fixes how your infrastructure handles spot preemptions and pod migrations:

  • Web Services & APIs: No 502 errors, no request timeouts, no user impact during preemptions
  • Databases & Caches: Maintain connections, preserve buffers, no rebalancing overhead
  • ML/AI Workloads: Models stay loaded, GPU state is preserved, training continues uninterrupted
  • Streaming Systems: Kafka consumers keep their positions, WebSocket connections survive
  • Batch Jobs: Complete without restart, no lost progress, no wasted compute

With Architect, every pod in your cluster can now safely run on spot instances.

The Power of Workload Mobility

The same technology that enables spot instance migration also powers:

  • Scale-to-Zero Without Cold Starts: Hibernate idle pods, wake them in <50ms
  • Node Maintenance Without Downtime: Evacuate nodes anytime without service impact
  • Dynamic Cost Optimization: Move workloads to the cheapest compute continuously and automatically

Only when your pods can move without interruption does your infrastructure become truly elastic.

Getting Started

Architect is currently in early access. We're working with select teams to dramatically reduce their Kubernetes costs without compromising reliability.

Join our waitlist to be among the first to deploy Architect in your cluster. We'll help you identify which workloads can benefit most and guide you through maximizing your spot instance savings.

Spot instances have always been a compromise - great prices but unreliable for production. With Architect, that compromise finally disappears.

SHIVANSH VIJ

Author:

shivansh vij

Founder & CEO

Stay in the {Loop}

Get our latest articles in your inbox by signing up for our newsletter: