Back to Blog
Incident3 min read

Post-Mortem: Kubernetes Control Plane Outage, December 16, 2025

A configuration change to our etcd backup system caused 23 minutes of control plane unavailability affecting 1,847 clusters. Running workloads were not affected. Here's what happened and what we're doing to prevent it.

RF
Rémi Fournier · Site Reliability Engineer

Summary

On December 16, 2025, between 14:23 UTC and 14:46 UTC, the Kubernetes control plane was unavailable for managed clusters in our Virginia, Oregon, and Frankfurt regions. Customers could not create, modify, or delete Kubernetes resources during this window. Existing workloads continued running uninterrupted because the kubelet on worker nodes operates independently of the control plane for already-scheduled pods.

The root cause was a misconfigured etcd compaction parameter deployed as part of an unrelated backup system change. The parameter caused etcd to attempt compaction of the entire keyspace simultaneously rather than incrementally, exhausting available memory and causing OOM kills.

Impact summary

Duration: 23 minutesAffected: 1,847 clustersRegions: Virginia, Oregon, Frankfurt

Running pods: NOT affectedNew deployments: BLOCKEDAPI calls: 503 errors

Timeline

Incident Timeline · December 16, 2025 14:12 Config deployed 14:23 OOM kill · impact begins 8GB → 42GB RAM 23 MINUTES IMPACT 14:28 Root cause identified compaction=0 in logs 14:33 Rollback deployed Emergency push 14:38 VA + OR recovered 14:46 All regions recovered Root Cause: Configuration Drift STAGING (correct) --auto-compaction-retention=1h 2 GB data · compaction completes in <1s PRODUCTION (stale playbook) --auto-compaction-retention=0 38 GB data · needs 5x available memory Playbook copied from old template 3 months prior · staging playbook was correctly updated separately
Incident timeline — December 16, 2025, 14:12–14:46 UTC

14:12 UTC — A configuration change to the etcd backup cronjob is deployed to production. The change was intended to enable incremental snapshots, reducing backup duration from 45 minutes to approximately 8 minutes. The change had been tested in staging for two weeks without issues.

14:23 UTC — The backup cronjob runs for the first time with the new configuration. A parameter that was correctly set in staging (--auto-compaction-retention=1h) was overridden by a stale value in the production Ansible playbook (--auto-compaction-retention=0), which etcd interprets as "compact everything immediately." All three regions' etcd nodes simultaneously spike from 8GB to 42GB memory usage and are OOM-killed.

14:24 UTC — Monitoring alerts fire. On-call SRE acknowledges within 90 seconds. Auto-restart fails because the compaction parameter persists, causing immediate re-OOM on startup.

14:28 UTC — Root cause identified from etcd logs showing compaction: start at revision 0. Rollback configuration drafted.

14:33 UTC — Rollback deployed via emergency configuration push (bypasses normal CI/CD). etcd nodes begin restarting correctly.

14:38 UTC — Virginia and Oregon fully recovered. Frankfurt delayed — one etcd node requires manual data directory cleanup after corrupted WAL from mid-compaction OOM.

14:46 UTC — All regions recovered. Verified by automated health checks against 200 sample clusters.

Root cause analysis

The direct cause was a stale configuration value in our production Ansible playbook, copied from an older template during a refactor three months earlier. The staging environment used a different playbook that had been correctly updated. The deeper issue: our etcd configuration testing didn't validate against production-scale data volumes (38GB production vs 2GB staging). At 2GB, full compaction completes in under a second and never triggers an OOM.

Prevention measures

ActionStatusOwnerDeadline
Correct production Ansible playbookCompletedSREDec 16
Add compaction param to pre-flight validationCompletedSREDec 17
etcd memory alerts at 60% thresholdIn progressMonitoringDec 30
Canary deployment for etcd config changesIn progressSREJan 2
Production-scale etcd test environmentPlannedInfraJan 31
cgroup memory limits for etcdPlannedSREJan 31

Apology

We take control plane availability seriously, and 23 minutes of downtime is not acceptable. This was a preventable configuration error, and the safeguards we're implementing should have existed before this incident. Affected customers will receive automatic SLA credits applied to their January invoice.