Diagnosing and Fixing Kubernetes Cluster Drift in Production

Cluster drift undermines reliability; here's how to detect, correct.

JR

3 minute read

Cluster drift undermines reliability; here’s how to detect, correct, and prevent configuration inconsistencies in Kubernetes environments.

Diagnosis

Cluster drift manifests as unexpected behavior, failed rollouts, or security gaps caused by configuration mismatches between declared state (e.g., Helm charts, GitOps repos) and live cluster state.

Common symptoms:

  • Pods failing with ImagePullBackoff despite valid manifests
  • Nodes reporting NetworkPluginCgroupConflictingPlugin errors
  • Unexplained changes to RBAC roles or network policies

Diagnostic workflow:

  1. Audit logs:
    kubectl get events --sort-by=.metadata.creationTimestamp --field-selector involvedObject.kind=Pod  
    

    Look for repeated restarts or admission webhook denials.

  2. Compare live state:
    kubectl get deploy -o yaml > live-deploy.yaml  
    diff <(git show main:deploy.yaml) live-deploy.yaml  
    
  3. Check node conditions:
    kubectl describe nodes | grep -E 'Taints|Pressure|Network'  
    

Repair Steps

Immediate remediation:

  1. Roll back known-good state:
    kubectl rollout undo deploy/<name> --to-revision <revision>  
    
  2. Force reapply manifests:
    kubectl apply --force -f ./manifests/  
    
  3. Fix node issues:
    • Drain and reboot nodes with kubectl drain --ignore-daemonsets --delete-emptydir-data <node>
    • Reconcile CNI plugins if multiple network providers are active

Post-remediation validation:

kubectl get pods -A -o wide | grep -v 'Running'  
kubectl get csr -o jsonpath='{.items[*].status}' | grep -v 'Approved'  

Prevention

Policy-as-code enforcement:
Use OPA Gatekeeper or Kyverno to enforce declarative policies. Example Kyverno policy to block privileged containers:

apiVersion: kyverno.io/v1  
kind: ClusterPolicy  
metadata:  
  name: block-privileged-containers  
spec:  
  validationFailureAction: enforce  
  rules:  
    - name: check-privileged  
      match:  
        resources:  
          kinds:  
            - Pod  
      validate:  
        pattern:  
          spec:  
            containers:  
              - (securityContext):  
                  privileged: false  

GitOps hygiene:

  • Enforce argocd-application sync via periodic cron jobs:
    0 * * * * kubectl get argocdapplications -n argocd | awk '{print $1}' | xargs -I {} kubectl -n argocd patch argocdapplication {} -p '{"spec":{"syncStrategy":{"prePerformSync": true}}}'  
    
  • Use kustomize build with --output=dir to validate manifests pre-commit

Tooling

Essential tools:

  • Audit: kubeaudit for dry-run policy checks
  • Monitoring: Prometheus + Thanos with alert KubernetesClusterDrift firing on up{job="kube-state-metrics"} == 0
  • Runtime: Falco for detecting unexpected container behavior

Command-line shortcuts:

# Find unapplied configmaps  
kubectl get cm -o jsonpath='{.items[*].metadata.annotations.kubectl\.kubectl\.io/last-applied-configuration}' | jq -r '. | select(contains("apiVersion: v1"))' | wc -l  

Tradeoffs

Strict policy enforcement increases deployment friction but reduces blast radius. Example: Blocking all latest image tags forces explicit versioning but requires build pipeline maturity. Balance via phased rollouts:

# Gradual policy adoption  
kubectl label namespaces <namespace> policy.gatekeeper.sh/exempt: "true"  

Troubleshooting

Common failure points:

  • RBAC gaps: kubectl auth can-i dry-runs miss aggregated RBAC policies. Check with:
    kubectl get clusterrolebindings -o wide | grep <user/serviceaccount>  
    
  • Image pull errors: Verify registry TLS certs with openssl s_client -connect <registry>:443
  • Node taints: Mismatched tolerations cause scheduling failures. Debug with:
    kubectl describe node <node> | grep -A 3 'Taints'  
    

Cluster drift is a tax on complexity—minimize it by reducing manual interventions and automating state reconciliation.

Source thread: The Kubernetes Book 2026?

comments powered by Disqus