Back to Blog
Product2 min read

Migrating 2.1 Million DNS Zones to Our New Anycast Infrastructure

We replaced our entire DNS serving infrastructure over a single weekend with zero customer-visible downtime. Here's the engineering that made it possible.

ZL
Zhen-Wei Lin · Backend Engineer

Why we migrated

Our previous DNS infrastructure was built on PowerDNS with a PostgreSQL backend, served from 8 anycast nodes. It worked, but it was showing its age. Query latency had crept up to 3-4ms average (from the 1ms we targeted), zone propagation took 15-30 seconds (we wanted under 5), and the PowerDNS configuration had accumulated enough technical debt that adding new record types required weeks of testing.

The new infrastructure is a custom DNS server written in Go that reads zone data from an in-memory store replicated via our internal gossip protocol. It serves from 16 anycast nodes and resolves queries in under 1ms at the 99th percentile.

The dual-serving strategy

For six weeks before the cutover weekend, every DNS query was handled by the old system while the new system received a shadow copy for validation.

// Shadow query validation loop
for query := range incomingQueries {
    oldResponse := oldSystem.Resolve(query)
    newResponse := newSystem.Resolve(query)
    
    if !ResponsesEqual(oldResponse, newResponse) {
        discrepancies.Increment()
        log.Warn("mismatch", 
            "query", query.String(),
            "old", oldResponse,
            "new", newResponse)
    }
}

During the first week, we found 847 discrepancies out of approximately 12 billion queries. Most were edge cases in CNAME chain resolution and wildcard record handling. We fixed each one, re-deployed, and continued shadow testing until we had 72 hours of zero discrepancies.

The cutover

Cutover Strategy: Saturday 06:00–10:00 UTC 06:00 06:30 07:00 08:00 09:00 10:00 100% 75% 50% 25% 0% New Old 10% 25% 50% · 1hr hold 75% 99% 100% Old System: p50 2.1ms · p99 8.4ms PowerDNS + PostgreSQL · 8 nodes New System: p50 0.4ms · p99 0.9ms Custom Go + in-memory · 16 nodes
Migration cutover: weighted traffic shift over 4 hours with zero failures

The actual migration happened on Saturday at 06:00 UTC. We used weighted DNS load balancing at the anycast layer: traffic was gradually shifted from old to new over 4 hours. At no point did any external monitoring service detect any resolution failures or latency spikes.

Zone replication performance

MetricOld SystemNew SystemImprovement
p50 resolution latency2.1ms0.4ms5.3x faster
p99 resolution latency8.4ms0.9ms9.3x faster
Avg zone propagation18 seconds2.8 seconds6.4x faster
p99 zone propagation47 seconds4.1 seconds11.5x faster
Anycast nodes8162x coverage
Consistency modelEventual12/16 quorumStronger

Changes are pushed simultaneously via gossip, and the API returns 200 only after 12 of 16 nodes confirm — strong consistency while tolerating up to 4 simultaneous node failures.

What's next

DNSSEC automation (auto-signing with key rotation) shipped two weeks post-migration. DNS-over-HTTPS and DNS-over-TLS expected Q1 2026. Real-time DNS analytics (query patterns, geographic distribution, response code breakdowns) in development.