Common OpenShift Pitfalls

So many moving parts... So many opportunities for failure.

JR Morgan

4 minute read

I recently had a customer inquire about the most common issues I’ve run into with OpenShift deployments, operations, and upgrades. This was a great question and justified a writeup!

Overview

With so many moving parts, OpenShift can be difficult to monitor and manage. Below are some, but certainly not all, of the common pitfalls associated with deploying & administering the platform at multiple layers.

Cluster-level Pitfalls

  • Certificates

    • Certs are used for secure communication/authentication across multiple OpenShift components: etcd, nodes, masters, service serving, tenant routes, etc. To make deployment easy, openshift-ansible configures configures a local CA on your first master, which is leveraged to issue certs of other master/node/etcd hosts. By default, each issued certificate is valid for 2 years, while the CA certificate is valid for 5 years. Certificate validity periods are configurable, but the certs for each service should be monitored to ensure they’re not nearing expiration. Playbooks are provided in atomic-openshift-utils to facilitate easy certificate renewal.
  • Missing or Misconfigured Quotas & Limits

    • Not setting explicit project quotas, limitRanges, or default request sizing can negatively impact cluster scheduling and performance. To be proactive, monitor per-node request and limit allocations to ensure you have adequate capacity and overcommit ratios for memory & CPU. If you’re planning to limit, or prohibit, tenant access to OpenShift dashboards, this might be built into catalog requests (e.g. per deploymentConfig, buildConfig, etc.) rather than enforced at the project level.
  • etcd

    • OpenShift uses etcd to store system configuration and state for all resource objects in the cluster. It’s your source of truth and should be monitored closely. Each etcd server provides local monitoring information on its client port through http(s) endpoints. The monitoring data is useful for both system health checking and cluster debugging. If you’re already pursuing Prometheus for cluster metrics ingestion it’s strongly recommended to scrape etcd cluster endpoints for metrics, too. This can provide insights on DB size, peer traffic, and client traffic.
  • ElasticSearch & fluentd

    • Since stdout from running containers is lost upon pod termination, it’s essential your ElasticSearch pod(s) and fluentd daemons continue operating to collect historical logs.. Make sure to size your ES storage adequately so log curation & recycling, set to 14 days by default, will responsibly prune indices of old entries. At a higher level, make sure to monitor pods, particularly fluentd and ES, to ensure they’re in a healthy, running state.
  • Backups

Node-level Pitfalls

  • Log Partition(s)

    • One of the most common issues I’ve personally seen on customer clusters relates to ContainerCreating stalls occurring after successful pod scheduling. A common, but not exclusive, cause is a filled filesystem hosting /var/log – you may not receive a clear error indicating this is the problem, and the node will still show happy & “Ready,” but the container(s) associated with a pod eventually fail to spawn. Monitor all filesystems associated with masters, nodes, etcd cluster members, and load balancers (if applicable).
  • Failed Scheduling

    • If a pod fails to schedule on a node it’s typically related to unavailable request requirements (e.g. container memory request cannot be fulfilled by available, ready nodes), node overcommitment, or other out of resource exceptions (e.g. memory or disk pressure). To proactively manage these conditions and ensure near- or at-capacity nodes continue operating as expected you can configure eviction policies which will permit the node to begin reclaiming node-level resources
  • Hostname Changes

    • Renaming a node without properly deleting & re-adding it to cluster can cause node registration failures. Because node registration requires valid certificates, and these certificates require valid & resolvable hostnames, you’ll need to delete the node from the cluster and re-execute the scale-up playbook to ensure the node regregisters. Another pitfall relates to a node rename operation when cloud-provider integration (e.g. Azure, AWS, GCE, or VMware) is enabled: instance/VM names must match OpenShift node names. Ensuring this consistency may involve rebuilding a new instance in most cases, or simply changing the VM name if using VMware.

Questions? Leave a comment!

comments powered by Disqus