A configuration change to our etcd backup system caused 23 minutes of control plane unavailability affecting 1,847 clusters. Running workloads were not affected. Here's what happened and what we're doing to prevent it.
On December 16, 2025, between 14:23 UTC and 14:46 UTC, the Kubernetes control plane was unavailable for managed clusters in our Virginia, Oregon, and Frankfurt regions. Customers could not create, modify, or delete Kubernetes resources during this window. Existing workloads continued running uninterrupted because the kubelet on worker nodes operates independently of the control plane for already-scheduled pods.
The root cause was a misconfigured etcd compaction parameter deployed as part of an unrelated backup system change. The parameter caused etcd to attempt compaction of the entire keyspace simultaneously rather than incrementally, exhausting available memory and causing OOM kills.
Duration: 23 minutesAffected: 1,847 clustersRegions: Virginia, Oregon, Frankfurt
Running pods: NOT affectedNew deployments: BLOCKEDAPI calls: 503 errors
14:12 UTC — A configuration change to the etcd backup cronjob is deployed to production. The change was intended to enable incremental snapshots, reducing backup duration from 45 minutes to approximately 8 minutes. The change had been tested in staging for two weeks without issues.
14:23 UTC — The backup cronjob runs for the first time with the new configuration. A parameter that was correctly set in staging (--auto-compaction-retention=1h) was overridden by a stale value in the production Ansible playbook (--auto-compaction-retention=0), which etcd interprets as "compact everything immediately." All three regions' etcd nodes simultaneously spike from 8GB to 42GB memory usage and are OOM-killed.
14:24 UTC — Monitoring alerts fire. On-call SRE acknowledges within 90 seconds. Auto-restart fails because the compaction parameter persists, causing immediate re-OOM on startup.
14:28 UTC — Root cause identified from etcd logs showing compaction: start at revision 0. Rollback configuration drafted.
14:33 UTC — Rollback deployed via emergency configuration push (bypasses normal CI/CD). etcd nodes begin restarting correctly.
14:38 UTC — Virginia and Oregon fully recovered. Frankfurt delayed — one etcd node requires manual data directory cleanup after corrupted WAL from mid-compaction OOM.
14:46 UTC — All regions recovered. Verified by automated health checks against 200 sample clusters.
The direct cause was a stale configuration value in our production Ansible playbook, copied from an older template during a refactor three months earlier. The staging environment used a different playbook that had been correctly updated. The deeper issue: our etcd configuration testing didn't validate against production-scale data volumes (38GB production vs 2GB staging). At 2GB, full compaction completes in under a second and never triggers an OOM.
| Action | Status | Owner | Deadline |
|---|---|---|---|
| Correct production Ansible playbook | Completed | SRE | Dec 16 |
| Add compaction param to pre-flight validation | Completed | SRE | Dec 17 |
| etcd memory alerts at 60% threshold | In progress | Monitoring | Dec 30 |
| Canary deployment for etcd config changes | In progress | SRE | Jan 2 |
| Production-scale etcd test environment | Planned | Infra | Jan 31 |
| cgroup memory limits for etcd | Planned | SRE | Jan 31 |
We take control plane availability seriously, and 23 minutes of downtime is not acceptable. This was a preventable configuration error, and the safeguards we're implementing should have existed before this incident. Affected customers will receive automatic SLA credits applied to their January invoice.