Incident3 min read

Post-Mortem: Kubernetes Control Plane Outage, December 16, 2025

A configuration change to our etcd backup system caused 23 minutes of control plane unavailability affecting 1,847 clusters. Running workloads were not affected. Here's what happened and what we're doing to prevent it.

Rémi Fournier · Site Reliability Engineer

Summary

On December 16, 2025, between 14:23 UTC and 14:46 UTC, the Kubernetes control plane was unavailable for managed clusters in our Virginia, Oregon, and Frankfurt regions. Customers could not create, modify, or delete Kubernetes resources during this window. Existing workloads continued running uninterrupted because the kubelet on worker nodes operates independently of the control plane for already-scheduled pods.

The root cause was a misconfigured etcd compaction parameter deployed as part of an unrelated backup system change. The parameter caused etcd to attempt compaction of the entire keyspace simultaneously rather than incrementally, exhausting available memory and causing OOM kills.

Impact summary

Duration: 23 minutesAffected: 1,847 clustersRegions: Virginia, Oregon, Frankfurt

Running pods: NOT affectedNew deployments: BLOCKEDAPI calls: 503 errors

Timeline

Incident timeline — December 16, 2025, 14:12–14:46 UTC

14:12 UTC — A configuration change to the etcd backup cronjob is deployed to production. The change was intended to enable incremental snapshots, reducing backup duration from 45 minutes to approximately 8 minutes. The change had been tested in staging for two weeks without issues.

14:23 UTC — The backup cronjob runs for the first time with the new configuration. A parameter that was correctly set in staging (--auto-compaction-retention=1h) was overridden by a stale value in the production Ansible playbook (--auto-compaction-retention=0), which etcd interprets as "compact everything immediately." All three regions' etcd nodes simultaneously spike from 8GB to 42GB memory usage and are OOM-killed.

14:24 UTC — Monitoring alerts fire. On-call SRE acknowledges within 90 seconds. Auto-restart fails because the compaction parameter persists, causing immediate re-OOM on startup.

14:28 UTC — Root cause identified from etcd logs showing compaction: start at revision 0. Rollback configuration drafted.

14:33 UTC — Rollback deployed via emergency configuration push (bypasses normal CI/CD). etcd nodes begin restarting correctly.

14:38 UTC — Virginia and Oregon fully recovered. Frankfurt delayed — one etcd node requires manual data directory cleanup after corrupted WAL from mid-compaction OOM.

14:46 UTC — All regions recovered. Verified by automated health checks against 200 sample clusters.

Root cause analysis

The direct cause was a stale configuration value in our production Ansible playbook, copied from an older template during a refactor three months earlier. The staging environment used a different playbook that had been correctly updated. The deeper issue: our etcd configuration testing didn't validate against production-scale data volumes (38GB production vs 2GB staging). At 2GB, full compaction completes in under a second and never triggers an OOM.

Prevention measures

Action	Status	Owner	Deadline
Correct production Ansible playbook	Completed	SRE	Dec 16
Add compaction param to pre-flight validation	Completed	SRE	Dec 17
etcd memory alerts at 60% threshold	In progress	Monitoring	Dec 30
Canary deployment for etcd config changes	In progress	SRE	Jan 2
Production-scale etcd test environment	Planned	Infra	Jan 31
cgroup memory limits for etcd	Planned	SRE	Jan 31

Apology

We take control plane availability seriously, and 23 minutes of downtime is not acceptable. This was a preventable configuration error, and the safeguards we're implementing should have existed before this incident. Affected customers will receive automatic SLA credits applied to their January invoice.

Post-Mortem: Kubernetes Control Plane Outage, December 16, 2025

Summary

Impact summary

Timeline

Root cause analysis

Prevention measures

Apology

Related Posts

How We Built Our BGP Network

Building the Terraform Provider

Rémi Fournier