Developer Playground
Understanding Kubernetes cgroup v2 & Deep Dive into JVM Pod Memory Issues
Table of Contents
What is cgroup?
cgroups (control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network) of a collection of processes. Without cgroups, containerization technologies like Docker and Kubernetes would not exist. When you define resources.limits.memory in a Kubernetes Pod spec, the kubelet ultimately translates this request into cgroup configurations on the host operating system.
cgroup v1 vs v2: The Paradigm Shift
For years, cgroup v1 served as the backbone of container orchestration. However, its architecture was notoriously fragmented. In v1, different resources (CPU, Memory, Block I/O) had their own independent hierarchies. A process could belong to one group for CPU and a completely different group for Memory, making holistic resource management incredibly complex and prone to inconsistencies.
cgroup v2 solves this by introducing a Unified Hierarchy.
Why Kubernetes Moved to cgroup v2
Kubernetes officially announced GA (General Availability) for cgroup v2 in version 1.25. The shift wasn't just a version bump; it unlocked significant architectural benefits:
- Memory QoS (Quality of Service): cgroup v1 only supported hard memory limits (kill the process if it goes over). cgroup v2 introduces
memory.highandmemory.low, allowing soft throttling and protection before invoking the lethal OOM killer. - Safe OOM Handling: Improved kernel awareness means the OS can more intelligently reclaim memory from page caches before ruthlessly killing application containers.
- eBPF Integration: Advanced networking and observability tools based on eBPF (like Cilium) heavily rely on the unified structure of cgroup v2 to track packets down to the exact container process.
The JVM Memory Crisis in cgroup v2
While cgroup v2 is fantastic for the Linux ecosystem, it created a massive headache for Java engineers migrating their Spring Boot applications.
Historically, the JVM used a feature called UseContainerSupport (enabled by default since Java 10, and backported to 8u191). This flag tells the JVM: "Hey, you are running inside a container. Don't look at the physical host's memory to set your Heap size. Instead, look at the cgroup limits."
The Path Discrepancy
The problem lies in how the JVM finds that limit.
- cgroup v1 path:
/sys/fs/cgroup/memory/memory.limit_in_bytes - cgroup v2 path:
/sys/fs/cgroup/memory.max
Older versions of the JVM (like initial releases of Java 11) were hardcoded to read the v1 path. When a Pod with an older JVM is scheduled on a Kubernetes node running cgroup v2 (like Ubuntu 22.04 or Amazon Linux 2023), it cannot find memory.limit_in_bytes.
What happens when the JVM can't find the container limit? It falls back to reading the underlying Host Node's physical RAM.
If your node has 64GB of RAM, the default JVM behavior (MaxRAMPercentage=25%) will set the Heap to 16GB. However, the Kubernetes Pod is rigidly constrained to 2GB by cgroup v2. As soon as the application receives traffic and the JVM attempts to allocate memory beyond 2GB, the Linux kernel's OOM killer instantly terminates the container (OOMKilled status 137).
Solutions and Best Practices
If you are moving to a modern Kubernetes environment, you must ensure your JVM is cgroup v2 aware.
-
Upgrade your JVM Version (The Best Solution)
cgroup v2 support was officially introduced in Java 15 (via JDK-8230305). Fortunately, it was backported to LTS versions. You must be running at least:- Java 8u372 or higher
- Java 11.0.16 or higher
- Java 17+ (Supported natively)
-
Explicitly Set -Xmx (The Mitigation)
If you absolutely cannot upgrade your JDK version immediately, you must hardcode the maximum heap size in your Docker entrypoint or JVM arguments to prevent it from reading the host memory:
-Xmx1500m(e.g., leaving 500MB for non-heap native memory in a 2GB Pod). -
Check your Node environment
You can verify if your Kubernetes node is running cgroup v2 by running this command inside a pod or on the node:If the output is# Check the filesystem type of the cgroup mount stat -fc %T /sys/fs/cgroup/cgroup2fs, you are running v2. If it istmpfs, you are likely still on v1.
As Cloud Providers like AWS EKS, GCP GKE, and Azure AKS default their latest node AMIs to OS versions that exclusively use cgroup v2 (like Amazon Linux 2023), understanding this interaction is no longer optional for Java engineers. Ensure your base Docker images rely on up-to-date JVM patches to prevent catastrophic production outages.