Evaluating 2-node Kubernetes Clusters with Remote Etcd

A 2-node Kubernetes cluster with remote etcd can work but introduces risks like network dependency and etcd performance.

JR

3 minute read

A 2-node Kubernetes cluster with remote etcd can work but introduces risks like network dependency and etcd performance bottlenecks that must be mitigated through careful design and monitoring.

Diagnosis: Why This Setup Risks Stability

A 2-node cluster with remote etcd creates a single point of failure for control plane availability. Etcd, as the source of truth, must be highly available, but a remote deployment introduces:

  • Network dependency: Cluster masters depend on stable, low-latency connectivity to etcd.
  • Etcd performance bottlenecks: Remote etcd clusters may suffer from increased latency or throughput limits.
  • Quorum fragility: A 2-node etcd cluster (common in remote setups) loses write capability if one node fails.

Common Failure Points

  • Network partitions between nodes and etcd causing control plane hangs.
  • Etcd disk latency spikes leading to API server timeouts.
  • Node pressure (CPU/memory) on 2-node clusters exacerbating control plane instability.

Repair Steps: Stabilizing an Existing Setup

  1. Audit etcd health:

    etcdctl --endpoints=https://etcd-server:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \  
      --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \  
      member list --write-out=table  
    

    Ensure all etcd members are reachable and healthy.

  2. Check network latency:

    ping -c 10 etcd-server  
    traceroute etcd-server  
    

    Latency >50ms or packet loss >1% warrants network team involvement.

  3. Profile node resource usage:

    kubectl top nodes  
    

    If nodes are CPU/memory constrained, scale up or distribute workloads.

Prevention: Designing for Resilience

Policy Example: Cluster Node and Etcd Requirements

1. Minimum 3 nodes for Kubernetes control plane.  
2. Etcd cluster co-located with control plane unless:  
   - Dedicated etcd cluster with 3+ nodes.  
   - Network SLA guarantees <20ms latency and 99.9% uptime.  
3. Monitoring alerts for:  
   - Etcd leader changes (>5/min).  
   - API server 5xx errors.  
   - Node network interruptions.  

Tooling for Visibility

  • etcdctl: For direct etcd member and health checks.
  • Prometheus + Grafana: Monitor etcd metrics (etcd_server_has_leader, etcd_disk_want_compact_total).
  • Network tools: mtr, tcpdump, or cloud-specific network analysis for latency/root cause.

Tradeoffs and Caveats

  • Cost vs. reliability: Adding nodes or dedicated etcd clusters increases cost but reduces blast radius.
  • Complexity: Remote etcd simplifies node scaling but adds operational overhead for network and etcd tuning.
  • Assumption: This guidance assumes on-prem or cloud with SLA-backed networking. Multi-cloud or hybrid setups require stricter validation.

Troubleshooting Checklist

  • Cluster hangs: Check etcd logs for leader elections or disk compaction warnings.
  • API server 5xx errors: Verify etcd latency metrics and node resource usage.
  • Node flapping: Investigate kernel panics or network MTU mismatches.

If you’re already running a 2-node setup with remote etcd, prioritize monitoring and have a rollback plan for node or network changes. For greenfield deployments, default to 3 nodes and co-located etcd unless you have a compelling reason to separate them—and the operational bandwidth to manage it.

Source thread: 2-node sites + remote etcd — am I building a time bomb?

comments powered by Disqus