Evaluating 2-node Kubernetes Clusters with Remote Etcd
A 2-node Kubernetes cluster with remote etcd can work but introduces risks like network dependency and etcd performance.
A 2-node Kubernetes cluster with remote etcd can work but introduces risks like network dependency and etcd performance bottlenecks that must be mitigated through careful design and monitoring.
Diagnosis: Why This Setup Risks Stability
A 2-node cluster with remote etcd creates a single point of failure for control plane availability. Etcd, as the source of truth, must be highly available, but a remote deployment introduces:
- Network dependency: Cluster masters depend on stable, low-latency connectivity to etcd.
- Etcd performance bottlenecks: Remote etcd clusters may suffer from increased latency or throughput limits.
- Quorum fragility: A 2-node etcd cluster (common in remote setups) loses write capability if one node fails.
Common Failure Points
- Network partitions between nodes and etcd causing control plane hangs.
- Etcd disk latency spikes leading to API server timeouts.
- Node pressure (CPU/memory) on 2-node clusters exacerbating control plane instability.
Repair Steps: Stabilizing an Existing Setup
-
Audit etcd health:
etcdctl --endpoints=https://etcd-server:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key \ member list --write-out=tableEnsure all etcd members are reachable and healthy.
-
Check network latency:
ping -c 10 etcd-server traceroute etcd-serverLatency >50ms or packet loss >1% warrants network team involvement.
-
Profile node resource usage:
kubectl top nodesIf nodes are CPU/memory constrained, scale up or distribute workloads.
Prevention: Designing for Resilience
Policy Example: Cluster Node and Etcd Requirements
1. Minimum 3 nodes for Kubernetes control plane.
2. Etcd cluster co-located with control plane unless:
- Dedicated etcd cluster with 3+ nodes.
- Network SLA guarantees <20ms latency and 99.9% uptime.
3. Monitoring alerts for:
- Etcd leader changes (>5/min).
- API server 5xx errors.
- Node network interruptions.
Tooling for Visibility
- etcdctl: For direct etcd member and health checks.
- Prometheus + Grafana: Monitor etcd metrics (
etcd_server_has_leader,etcd_disk_want_compact_total). - Network tools:
mtr,tcpdump, or cloud-specific network analysis for latency/root cause.
Tradeoffs and Caveats
- Cost vs. reliability: Adding nodes or dedicated etcd clusters increases cost but reduces blast radius.
- Complexity: Remote etcd simplifies node scaling but adds operational overhead for network and etcd tuning.
- Assumption: This guidance assumes on-prem or cloud with SLA-backed networking. Multi-cloud or hybrid setups require stricter validation.
Troubleshooting Checklist
- Cluster hangs: Check etcd logs for leader elections or disk compaction warnings.
- API server 5xx errors: Verify etcd latency metrics and node resource usage.
- Node flapping: Investigate kernel panics or network MTU mismatches.
If you’re already running a 2-node setup with remote etcd, prioritize monitoring and have a rollback plan for node or network changes. For greenfield deployments, default to 3 nodes and co-located etcd unless you have a compelling reason to separate them—and the operational bandwidth to manage it.
Source thread: 2-node sites + remote etcd — am I building a time bomb?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email