Scale Kubernetes Pods to Zero, Without Cold Starts

The Cold Start Problem

Imagine Kubernetes pods being turned on and off as easily as flipping the light switch when you enter or leave a room. No scheduling latency or start up delay, just instantly on like the lights.

Many discussions make it appear as if scaling pods to zero is a solved problem and operates as simply as the light switch. This is misleading at best. Sure Kubernetes has had the primitives and vendors have made promises for years now. You set up an autoscaler, traffic drops, pods disappear, your bill shrinks. Simple, right?

Except it is not simple, because scale to zero has a dirty secret in many cases: the moment traffic returns, you are starting an application from scratch. Your pods have been deleted. Kubernetes has to reschedule them onto a node, pull the container image, run init containers, wait for the application runtime to initialize or database cache to be warmed up, and serve a health check before a single request gets through.

For a Go service, that might be a few seconds. For a Database or JVM-based application like Spring Boot or Kafka, you are looking at closer to a minute or more.

The experience of flipping a light switch and waiting for 30+ seconds for the lights to come on? That is the cold-start of a Kubernetes pod, and it drives all sorts of architectural and efficiency anti-patterns.

That gap between zero and ready is the cold start, and it is the real cost of conventional scale-to-zero. For plenty of workloads the cost and complexity are high enough that teams just leave pods running.

They end up paying for idle capacity around the clock to avoid the user-facing latency that comes with every wake cycle.

Architect was built around a different premise: pods should go to zero, but they should never cold start.

What Actually Causes Cold Starts

When most scale-to-zero tools bring a pod back, they recreate it from nothing. The scheduler finds a node, the runtime pulls the image, the application boots, and only then does traffic flow. This is not a configuration problem. It is a consequence of the underlying model: the pod was deleted, and something new has to be created in its place.

The sequence looks like this:

Some tooling tries to paper over this with techniques like pre-warming pools or keeping a "shadow" replica running. Those approaches trade cost for latency, and they still require you to predict demand ahead of time and provision accordingly. You are back to over-provisioning, just with extra steps.

The fundamental issue is that rescheduling is relatively expensive, and any approach that relies on it will have a cold start floor that application teams have to architect around (ba-dum-tss).

Hibernation, Not Deletion

Architect does not delete pods. It hibernates them.

When a pod has been idle (for a configurable duration), Architect checkpoints the running container and preserves its execution state to local node storage, and reduces the pod's resource requests to zero. The pod remains registered with the Kubernetes API, stays attached to its Services, and keeps its PVCs mounted. This subtle difference between a hibernated pod and a deleted pod is actually quite important because from Kubernetes' perspective the pod still exists but from the node's perspective it is consuming no CPU or memory.

If this sounds like much better bin packing, that's because it is.

When traffic arrives, Architect restores the container from the checkpoint in under 50ms. The application picks up exactly where it left off. No image pull, no init sequence, no JVM warmup.

Packets that arrive during the wake cycle are buffered and delivered once the container is running. No requests are dropped. No timeout errors surface to the client.

How Architect Fits Into Your Cluster

Architect runs as a DaemonSet so the Architect agents run on each labeled node and handles the checkpoint, hibernate, and wake operations locally. A cluster-level control plane coordinates checkpoint locking during pod migrations, while the checkpoint data itself transfers daemon-to-daemon. You can learn more about the Architect components in the How It Works section of our docs.

Because there are no external analytics platforms and no separate optimization layer making decisions about your cluster, running Architect is simple and well integrated into native Kubernetes operational patterns and tooling. Hibernation and wake are handled by architectd on the node where the pod lives. The decision logic is: if the pod has been idle for N seconds, hibernate it; if traffic arrives, wake it. You set the threshold. Architect executes it.

For a detailed view of configuration options take a look at the configuration or examples section of our documentation.

No Code Changes Required

Enabling Architect on a workload is a two-line annotation change to your pod spec. There is no SDK to import, no sidecar to configure, no application-level checkpoint logic to write.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-service
spec:
  template:
    metadata:
      annotations:
        architect.loopholelabs.io/scale-down-timeout: "60s"
    spec:
      runtimeClassName: runc-architect
      containers:
        - name: my-service
          image: my-service:latest

Two things to note here. runtimeClassName: runc-architect tells the node to run this container through Architect's shim, which is what enables checkpoint and restore. The scale-down-timeout annotation sets how long Architect waits after the last observed activity before hibernating. Both are pod-spec concerns. Your application code is untouched.

If you are running StatefulSets, the same annotations apply. PVCs stay mounted through the hibernation cycle, so your stateful workloads wake with their storage intact.

For workloads where you want explicit control over checkpoint lifecycle, Architect exposes a PersistentCheckpoint CRD. This lets you name and manage checkpoints directly, which is useful for workflows where you want a known-good restore point rather than the most recent automatic checkpoint.

Use Case: Developer Environments

Most developer environments are a textbook idle resource problem. Your staging clusters, feature branch environments, and internal tooling are heavily used during the business day. Then the team goes home, and those pods run until morning or over the weekend, burning compute for nothing.

The conventional answer is to write a cron job or similar that scales deployments to zero at night and back up in the morning. This works, but the first engineer who opens a staging tab at 08:00 before the scale-up job runs gets a 60-second wait while Spring Boot remembers who it is. If that engineer is in a different timezone, or they work a non-standard schedule, the friction compounds quickly.

With Architect, the environments hibernate automatically when idle. When the first requests hit in the morning the pod wakes in under 50ms. The environment is ready before the browser tab finishes rendering.

The experience is similar to that light switch I mentioned earlier than it is to other scale to zero solutions which rely on cold start of pods.

This works across the full shift spectrum. A team distributed across UTC-8 to UTC+5 can have a single staging environment that is effectively always available and never burning idle compute. Engineers in later timezones do not pay a cold start tax because the environment went quiet while the earlier timezone slept.

The setup is the same two-line annotation. For a cluster of developer environments, you might set a tighter idle threshold:

annotations:
  architect.loopholelabs.io/scale-down-timeout: "300s"  # 5 minutes idle

Use Case: Spiky / Periodic Production Traffic

Retail holidays, ticket sales, and any platform with highly fluctuating sale patterns share a common characteristic: demand is predictable in aggregate but unpredictable in timing. You know a sale is coming. You do not know exactly when the load will spike, or how steep the initial ramp will be.

The conventional response is to pre-provision capacity before the event. You scale up replicas early, run them through the quiet period, and scale back down when you think the spike passes. This is safe (if you know the demand) and expensive.

An alternative that does not work well is relying on reactive autoscaling to provision new pods during the spike. By the time the autoscaler detects the load increase, schedules new pods, and waits for them to pass health checks, you have already dropped requests or served degraded latency to the customers driving the most revenue in your quarter.

Architect gives you a third option. You can maintain a pool of hibernated replicas that are ready to wake on the first packet. The replicas exist in the cluster and are registered with your Services. Their resource cost at rest is zero. When the spike hits, they wake on demand without a scheduling or initialization delay.

Architect works alongside HPA without conflict. HPA manages how many replicas exist. Architect manages the resource cost of idle replicas. They operate at different layers and do not interfere with each other.

For a seasonal retail workload, a reasonable pattern is to scale replica count up before an anticipated event using HPA or manual scaling, let Architect hibernate the extra replicas, and then let the live traffic wake them on demand as the spike materializes. You get burst capacity without the cold start risk and without paying for active compute on standby pods.

TL; DR: Other Patterns Do Not Work at Scale

It is worth being direct about why some common approaches to scale-to-zero hit a ceiling.

Deletion-based scale-to-zero treats pods as disposable. Every wake cycle is a full pod creation: schedule, pull, init, boot. The cold start latency is baked into the architecture. For stateless micro-services with fast boot times this is sometimes acceptable. For anything stateful, JVM-based, or with meaningful initialization logic, it is not.

Analytics-driven optimization platforms require you to instrument your cluster, pipe usage metrics into an external system, wait for the system to build a model of your workload patterns, and then act on recommendations the platform surfaces. The operational overhead is real and annoying with these platforms. Perhaps importantly it is also often the case that the decision logic lives outside your cluster. When something goes wrong at 2AM, you are debugging across two systems. These platforms can be useful for cost visibility and maybe even rightsizing, but they are not solving the cold start problem.

Pre-warming and shadow replicas keep a warm instance running to absorb the first request while additional capacity scales up behind it. This eliminates cold starts at the cost of eliminating scale-to-zero. You are always paying for at least one active replica.

Architect's model is more elegant: the pod never leaves, it just stops consuming resources. No rescheduling, no external system, no permanently warm replica. The checkpoint is the mechanism. The hibernated pod is the warm standby that costs nothing.

Ready for an Architect POC?

The full Quick Start walks through installation, node labeling, and deploying your own workloads with Architect annotations.

If you are evaluating Architect for a specific workload or want to talk through whether it fits your architecture, schedule a demo with the team.