We replaced our entire DNS serving infrastructure over a single weekend with zero customer-visible downtime. Here's the engineering that made it possible.
Our previous DNS infrastructure was built on PowerDNS with a PostgreSQL backend, served from 8 anycast nodes. It worked, but it was showing its age. Query latency had crept up to 3-4ms average (from the 1ms we targeted), zone propagation took 15-30 seconds (we wanted under 5), and the PowerDNS configuration had accumulated enough technical debt that adding new record types required weeks of testing.
The new infrastructure is a custom DNS server written in Go that reads zone data from an in-memory store replicated via our internal gossip protocol. It serves from 16 anycast nodes and resolves queries in under 1ms at the 99th percentile.
For six weeks before the cutover weekend, every DNS query was handled by the old system while the new system received a shadow copy for validation.
// Shadow query validation loop
for query := range incomingQueries {
oldResponse := oldSystem.Resolve(query)
newResponse := newSystem.Resolve(query)
if !ResponsesEqual(oldResponse, newResponse) {
discrepancies.Increment()
log.Warn("mismatch",
"query", query.String(),
"old", oldResponse,
"new", newResponse)
}
}
During the first week, we found 847 discrepancies out of approximately 12 billion queries. Most were edge cases in CNAME chain resolution and wildcard record handling. We fixed each one, re-deployed, and continued shadow testing until we had 72 hours of zero discrepancies.
The actual migration happened on Saturday at 06:00 UTC. We used weighted DNS load balancing at the anycast layer: traffic was gradually shifted from old to new over 4 hours. At no point did any external monitoring service detect any resolution failures or latency spikes.
| Metric | Old System | New System | Improvement |
|---|---|---|---|
| p50 resolution latency | 2.1ms | 0.4ms | 5.3x faster |
| p99 resolution latency | 8.4ms | 0.9ms | 9.3x faster |
| Avg zone propagation | 18 seconds | 2.8 seconds | 6.4x faster |
| p99 zone propagation | 47 seconds | 4.1 seconds | 11.5x faster |
| Anycast nodes | 8 | 16 | 2x coverage |
| Consistency model | Eventual | 12/16 quorum | Stronger |
Changes are pushed simultaneously via gossip, and the API returns 200 only after 12 of 16 nodes confirm — strong consistency while tolerating up to 4 simultaneous node failures.
DNSSEC automation (auto-signing with key rotation) shipped two weeks post-migration. DNS-over-HTTPS and DNS-over-TLS expected Q1 2026. Real-time DNS analytics (query patterns, geographic distribution, response code breakdowns) in development.