JVM Warmup: Optimizing Java Application Startup

Class Loading

When a JVM process starts, all required classes are loaded into memory by the class loader through three stages. This process is based on lazy loading.

Bootstrap Class Loading: The Bootstrap Class Loader loads Java classes. It loads essential classes like java.lang.Object from JRE/lib/rt.jar.
Extension Class Loading: The ExtClassLoader is responsible for all JAR files in the java.ext.dirs path. These are JAR files manually added by developers, not those in Gradle or Maven-based applications.
Application Class Loading: The AppClassLoader loads all classes in the application class path.

Execution Engine

Java is a hybrid that uses both interpretation and compilation. When Java code is compiled through javac, it is transformed into platform-independent bytecode. At runtime, the JVM executes this bytecode as native code through interpretation. The Just-In-Time Compiler compiles frequently executed code (entire methods) into native code at runtime. These compiled sections are then used directly in subsequent executions, known as hotspots.

C1 - Client Compiler

Focuses on quick startup and good response times. Its optimization level is lower than C2, but it has shorter compilation times to provide faster initial execution speed and better user experience. It uses simpler optimizations to compile code quickly, enabling applications to run faster sooner.

C2 - Server Compiler

Focuses on performance optimization rather than speed. Through a longer compilation process than C1, it performs deeper optimizations to maximize execution speed. Compared to C1, it uses advanced optimization techniques to analyze code execution patterns during runtime and makes the code more efficient, improving long-term performance.

Tiered Compilation

Active by default from Java 8, and it's recommended to use the default settings.

The C2 compiler typically uses more memory and time compared to C1, but provides more optimized native code. Tiered compilation was first introduced in Java 7. The goal is to use both C1 and C2 to achieve both fast startup time and long-term performance improvement.

It uses the interpreter to collect profiling information for methods and provides this to the compiler. C1 then generates a compiled version using this information. During the application's lifecycle, frequently used methods are loaded into the native cache.

When an application starts, the JVM initially interprets all bytecode and profiles it. The JIT compiler uses this collected profiling data to identify hotspots.

First, the JIT compiler quickly compiles frequently executed code sections to native code using C1. Later, C2 uses profiling information generated by the interpreter and C1 to further optimize the native code that C1 compiled. This process takes longer than the time required by C1.

Code Cache

This is a memory area where the JVM stores all bytecode compiled into native code. Tiered Compilation increases the amount of code in the code cache area by 4 times.

After Java 9, the code cache was divided into three areas to improve locality and reduce memory fragmentation:

Non-method segment - JVM internal related code (about 5MB, adjustable via -XX:NonNMethodCodeHeapSize)
The profiled-code segment - Code compiled by C1, which may have a short lifespan (default ~122MB, adjustable via -XX:ProfiledCodeHeapSize)
The non-profiled segment - Code compiled by C2, which may have a longer lifespan (default ~122MB, adjustable via -XX:NonProfiledCodeHeapSize)

Deoptimization

There's a possibility that code compiled by C2 might not be optimized. In such cases, the JVM temporarily rolls back to interpretation mode. For example, when profile information doesn't match the actual method execution.

Compilation Levels

The interpreter and JIT compiler have five levels:

Level 0 - Interpreted Code

At this stage, the JVM interprets all Java code. It reads and executes bytecode line by line, resulting in lower performance compared to compiled languages at this stage.

Level 1 - Simple C1 Compiled Code

The JVM compiles methods deemed non-critical using C1 without collecting profiling information. This typically applies to very simple or low-complexity methods. These methods aren't expected to show significant performance improvements even with further optimization by C2. The main purpose is to speed up execution, allowing code to run with minimal overhead. Since profiling information isn't collected, the JVM doesn't decide on additional optimization for code running at this level. This reduces system resource usage and ensures fast execution for simple methods.

Level 2 - Limited C1 Compiled Code

C1 analyzes code through lightweight profiling. The JVM uses this stage when the C2 Queue is full. Since C2 performs extensive optimizations requiring significant time and resources, it temporarily uses C1 with lightweight profiling to improve performance without waiting.

Level 3 - Full C1 Compiled Code

After running code compiled at level 2 for some time, the JVM collects more runtime data and compiles it with full profiling through C1 at this stage. This includes more comprehensive data collection than lightweight profiling, allowing identification of complex patterns and optimization opportunities. It collects detailed execution metrics for more complex optimizations that C2 will perform.

Level 4 - C2 Compiled Code

When the C2 Queue is available and important hotspots are identified based on full profiling from level 3, this stage proceeds. C2 applies optimization techniques to generate native code. This is the final stage and aims to maximize execution efficiency based on insights gained from extensive profiling data.

The JVM continues with interpretation until reaching the Tier3CompileThreshold. After that, C1 compiles the method and continues profiling. Finally, C2 compiles when reaching the Tier4CompileThreshold. The JVM may decide to deoptimize C2-compiled code, in which case the process starts again from the beginning.

JVM Warming Up

After class loading completes, important classes used during process startup enter the JVM cache for faster operation during runtime. Other classes go into the JVM cache on a per-request basis when requested.

Due to lazy class loading and Just In Time compilation, the first request in a Java web application has a slower average response time.

To improve the slow response in the first request, all classes need to be pre-loaded into the JVM cache. This process is called JVM warming up.

Manual Implementation

This involves writing a custom class loader that directly uses classes used at application startup. For web applications, you can make the application send API requests to itself. In Spring Boot applications, you can use CommandLineRunner or ApplicationRunner to make internal calls during the Spring lifecycle process.

ApplicationStartEvent
ApplicationEnvironmentPreparedEvent
ApplicationContextInitializedEvent
ApplicationPreparedEvent
ApplicationStartEvent
AvailabilityChangeEvent(LivenessState.CORRECT)
ApplicationRunner, CommandLineRunner execution
- You can preload classes used in the application through internal calls to load them into the native cache.
ApplicationReadyEvent(ReadinessState.ACCEPTING_TRAFFIC)

About JVM Warm-up