Start Chaos Experiments with Known Weaknesses, Not Random Targets

Begin chaos experiments by targeting services with known fragility or recent changes, not random workloads.

JR

2 minute read

Begin chaos experiments by targeting services with known fragility or recent changes, not random workloads.

Why This Matters

Kubernetes makes failure injection easy, but chaos without focus wastes time and risks overloading stable systems. Targeting known weaknesses first surfaces actionable issues, builds team confidence, and avoids “chaos for chaos' sake.”

Actionable Workflow

  1. Identify candidates:

    • Review monitoring/alerting for recurring errors (e.g., kubectl get pods --show-labels | grep -i error).
    • List services with recent deployments (kubectl get deployments -o wide --sort-by='{.metadata.creationTimestamp}').
    • Flag dependencies (databases, caches, external APIs) that lack redundancy.
  2. Prioritize:

    • Rank by blast radius (e.g., a payment service > a background job processor).
    • Focus on services with incomplete observability or undocumented failover.
  3. Validate:

    • Check if the target has existing chaos tests (e.g., kubectl get chaosexperiments -A).
    • Confirm exclusion policies (e.g., kubectl describe namespace kube-system | grep exclusion).

Policy Example

# Exclusion policy snippet (applied via namespace labels)  
metadata:  
  labels:  
    chaos.excluded: "true"  
    chaos.exclusion_reason: "third_party_app_no_control"  

Rules:

  • Default allow chaos in all non-excluded namespaces.
  • Exclusions require SRE + application team sign-off, documented in a shared registry.
  • No exclusions for “fear of breaking things” without a remediation plan.

Tooling

  • Chaos Mesh: Use Action and Selector fields to target specific workloads.
    Example:
    spec:  
      podChaos:  
        action: restart  
        selector:  
          matchLabels:  
            app: payment-service  
    
  • Litmus: Leverage ExperimentTemplates with affinity rules to constrain targets.
  • Monitoring integration: Pair with Prometheus/Grafana to auto-detect anomalies post-injection.

Tradeoffs

  • Over-isolation: Excluding too many components (e.g., all stateful services) reduces chaos effectiveness.
  • False confidence: Fixing only targeted issues might miss systemic problems (e.g., network partitions).
  • Operational overhead: Maintaining exclusion policies requires ongoing collaboration between SRE and app teams.

Troubleshooting

  • No targets found:
    • Check if monitoring is underconfigured (e.g., no alerts for deployment failures).
    • Ensure recent deployments are labeled correctly.
  • Experiments not triggering:
    • Verify chaos operator permissions (kubectl auth can-i create chaosexperiments --all-namespaces).
    • Check for conflicting exclusion labels.
  • Unexpected downtime:
    • Review blast radius constraints (e.g., kubectl get chaosresultants -A).
    • Audit exclusion requests for abuse (e.g., app teams blocking tests without cause).

Final Note

Chaos experiments should evolve from debugging known issues to validating systemic resilience. Start small, iterate, and formalize exclusions as part of platform governance—not ad-hoc exceptions.

Source thread: For teams running chaos experiments on Kubernetes, how do you pick the first target?

comments powered by Disqus