EKS AL2 to AL2023 memory usage spikes in nginx, anyone else
Diagnosing and Resolving Memory Spikes in Nginx on EKS AL2023 If you’ve migrated from Amazon Linux 2 (AL2) to AL2023 on EKS and noticed memory spikes in Nginx.
Diagnosing and Resolving Memory Spikes in Nginx on EKS AL2023
If you’ve migrated from Amazon Linux 2 (AL2) to AL2023 on EKS and noticed memory spikes in Nginx pods, you’re not alone. This post walks through a pragmatic diagnosis and repair workflow, with actionable steps to prevent recurrence.
Diagnosis: What’s Changed?
AL2023 introduces updates to containerd, kernel versions, and cgroup management. Memory spikes in Nginx often stem from:
-
cgroup v1 vs v2 Reporting Differences
AL2023 defaults to cgroup v2, which reports memory usage differently than v1. Nginx (or its underlying glibc) may misreport memory under cgroup v2, causing Kubernetes to OOMKill pods despite available memory. -
Resource Limits Misconfiguration
If memory limits aren’t aligned with Nginx’s actual usage patterns (e.g., during SSL termination or high request volume), Kubernetes may evict pods prematurely. -
Kernel or containerd Bugs
While AL2023 includes fixes (e.g., containerd 1.5+), older versions of Nginx or misconfigured node agents (e.g., kubelet) can exacerbate issues.
Verify the Issue
Run these commands to triage:
# Check node/pod memory usage
kubectl top nodes
kubectl top pods -l app=nginx
# Inspect pod events for OOMKilled
kubectl describe pod <nginx-pod-name>
# Check containerd version (should be ≥1.5)
cat /etc/os-release && containerd --version
If pods are OOMKilled despite low actual memory usage, suspect cgroup v2 reporting or misconfigured limits.
Repair Steps
1. Adjust Memory Requests/Limits
Temporarily increase memory limits to mitigate OOMKills while diagnosing:
resources:
requests:
memory: "512Mi"
limits:
memory: "1Gi"
Monitor usage with kubectl top pods and adjust based on observed peaks.
2. Force cgroup v1 (If Necessary)
If cgroup v2 is suspected, force v1 on nodes:
-
Update kernel boot parameters:
echo 'cgroup_enable=memory' >> /boot/grub/grub.cfg -
Reboot nodes and verify:
mount | grep cgroup # Should show "cgroup on /sys/fs/cgroup/memory type cgroup"
Tradeoff: cgroup v1 is deprecated; use this only as a temporary workaround.
3. Update Nginx and Dependencies
Ensure Nginx is updated to a version ≥1.21.0 (better cgroup v2 compatibility). Example:
# In Dockerfile
FROM nginx:1.21
4. Validate Kernel and containerd
Ensure nodes use kernel ≥5.10 and containerd ≥1.5. Update EKS node groups to the latest AL2023 AMI.
Prevention
Policy Example: Resource Quotas
Enforce memory limits at the namespace level to prevent runaway usage:
apiVersion: v1
kind: ResourceQuota
metadata:
name: nginx-memory-quota
spec:
hard:
memory: "10Gi"
pods: "10"
Apply to production namespaces to cap total memory consumption.
Monitoring Workflow
- Alert on Memory Usage: Use Prometheus + Alertmanager to trigger alerts when Nginx memory exceeds 80% of limits.
- Log Node OOM Events:
journalctl -u kubelet | grep -i "out of memory" - Regularly Review Metrics:
kubectl get hpa,quota,limits -A
Tooling
- kubectl: For real-time pod/node metrics (
kubectl top). - Prometheus/Grafana: Long-term monitoring of memory trends.
- Node Problem Detector: Logs node-level issues (e.g., OOM events).
- AWS CloudWatch: Track node memory usage outside Kubernetes.
Conclusion
Memory spikes in Nginx on AL2023 are often due to cgroup v2 reporting quirks or misconfigured limits—not true leaks. Adjust resource limits, validate dependencies, and enforce quotas to stabilize workloads. Prioritize monitoring and updates to prevent recurrence. If spikes persist, test cgroup v1 as a fallback while working with AWS/Kubernetes upstream teams for long-term fixes.
Source thread: EKS AL2 to AL2023 memory usage spikes in nginx, anyone else?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email