Blue/green Cluster Upgrades in Eks with External-dns

Streamline EKS blue/green upgrades by orchestrating node groups, external-dns synchronization.

JR

2 minute read

Streamline EKS blue/green upgrades by orchestrating node groups, external-dns synchronization, and DNS propagation checks to minimize downtime.

Workflow

  1. Prepare Green Cluster

    • Create a new EKS cluster (green) with updated Kubernetes version using eksctl or AWS Console.
    • Mirror node groups from blue cluster, ensuring identical IAM roles and security group configurations.
  2. Sync DNS Records

    • Deploy external-dns to green cluster with --dns-provider=route53 and --sync-only flag on blue cluster to avoid conflicts.
    • Validate DNS records are populated in green cluster using kubectl get services -o wide.
  3. Validate Services

    • Deploy test workloads to green cluster and confirm endpoints are reachable.
    • Use dig or AWS Route 53 dashboard to verify DNS records point to green cluster IPs.
  4. Cutover DNS

    • Update DNS TTL to 60 seconds beforehand to reduce propagation delays.
    • Flip DNS A/CNAME records to point to green cluster ingress controllers.
  5. Monitor and Rollback

    • Watch CloudWatch metrics (e.g., HTTP 5xx errors) and node health for 30 minutes post-cutover.
    • If issues arise, revert DNS and drain green cluster nodes using kubectl drain --ignore-daemonsets --delete-emptydir-data.

Policy Example

Node Group Upgrade Policy

  • Enforce labels kubernetes.io/role: node and eks.amazonaws.com/capacity-type: <value> on new node groups.
  • Require taints/toleration matches between blue and green clusters to prevent scheduling mismatches.

Tooling

  • eksctl: Manage cluster and node group lifecycle (eksctl create nodegroup --nodes 3).
  • external-dns: Sync services to Route 53 (external-dns --provider kube --domain <domain> --txt-ttl 60).
  • AWS Route 53: Monitor DNS propagation via dashboard or dig @route53-server <domain>.
  • Prometheus/Grafana: Alert on service latency or error rate spikes during cutover.

Tradeoffs

  • Resource Overhead: Green cluster requires ~2x node resources temporarily.
  • DNS Propagation: Even with low TTL, global users may experience minutes of latency.
  • Sync Conflicts: Misconfigured external-dns RBAC or duplicate DNS entries can cause outages.

Troubleshooting

  • DNS Not Updating:

    • Check external-dns logs for AWS API errors (kubectl logs -n kube-system deployment/external-dns).
    • Verify IAM policy permissions for Route 53 (external-dns requires route53:ChangeResourceRecordSets).
  • Node Registration Failures:

    • Inspect cloud-controller-manager logs (kubectl logs -n kube-system <cloud-controller-pod>).
    • Confirm instance role on green cluster nodes has eks:DescribeNodegroup permissions.
  • Service Endpoints Stale:

    • Force endpoint controller sync: kubectl annotate service <service-name> endpoints.openapiserver.kubernetes.io/reconcile.

Avoid overcomplicating with canary deployments unless you need granular traffic shifting—blue/green is simpler for most use cases. Always test upgrades in staging with production-like workloads first.

Source thread: Any tips on blue/green cluster upgrades in EKS while using external-dns?

comments powered by Disqus