LockSupport.parkNanos() Under the Hood and the Curious Case of Parking

When a colleague of mine was running some experiments, he noticed `LockSupport.parkNanos()` would either return almost immediately or in roughly 50 microseconds steps. In other words: Calling `LockSupport.parkNanos(10000)` would not return after 10 microseconds, but roughly after 50 μs. `LockSupport.parkNanos(55000)` would not return after 55 μs, but roughly after 100 μs, etc. The 50 μs step was present way too consistently to be a coincidence. I was curious what was causing it and because I love exploring how stuff works under the hood, I decided to have a closer look.

Reproducer in Java

The first step was easy: Write a reproducer. It turns out to be simple. I re-used code from my older and somewhat related experiment and just added a new runner:

@Override
    public void run() {
        long startNanos = System.nanoTime();
        for (long i = 0; i < iterationCount; i++) {
            LockSupport.parkNanos(10_000);
        }
        long durationNanos = System.nanoTime() - startNanos;
        long durationMillis = NANOSECONDS.toMillis(durationNanos);
        System.out.println(iterationCount + " iterations in " + durationMillis + " ms.");

        long microsPerIteration = NANOSECONDS.toMicros(durationNanos) / iterationCount;
        System.out.println("This means each iteration took " + microsPerIteration + " microseconds");
    }

This was sufficient to reproduce the results my colleague observed in his JMH-based benchmark. On my Linux box, each iteration took just over 60 μs. When I increased the parking time to 55,000 ns then each iteration took around 110 μs and with the parking time 110,000 ns each iteration took 150 μs. It also took 150 μs with parking time set to 105,000 ns. Clearly, the 50 μs increment was there.

The first step of my investigation was easy: Just follow the `LockSupport.parkNanos()` implementation in my JDK. We can see it delegates to Unsafe.park():

public static void parkNanos(long nanos) {
        if (nanos > 0)
            U.park(false, nanos);
    }

Unfortunately, the Unsafe class does not tell much:

public native void park(boolean isAbsolute, long time);

It’s a native method which means we have to dive deeper. It turns out the implementation of the park() method JVM uses when running on Linux uses POSIX Threads API to implement thread parking.

Reproducer in C

Knowing this, I tried to create a simple reproducer in C. The idea was to remove JVM and its magic out of the equation and be as close to the metal as possible. I wrote my last non-trivial C code more than 15 years ago so this turns out to be quite challenging. I learned many things, for example I didn’t know glibc had a stub implementation of POSIX Threads API and it does nothing! So I was wondering why my `pthread_cond_timedwait()` returned immediately, without waiting for the timeout. Fortunately another colleague of mine pointed out I forgot to link the pthreads library, hence my program used the empty implementation from glibc! Eventually I succeeded and here is a C reproducer:

#include <stdio.h>
#include <time.h>
#include <sys/time.h>
#include <stdlib.h>
#include <pthread.h>
#include <stdbool.h>
#include <sys/types.h>
#include <unistd.h>

#define CLOCK_TYPE CLOCK_MONOTONIC

void printTime(struct timespec *ts) {
    printf("%lld.%.9ldn", (long long)ts->tv_sec, ts->tv_nsec);
}

void addNanos(struct timespec *base, struct timespec *deadline, int deltaNanos) {
    deadline->tv_sec = base->tv_sec;
    deadline->tv_nsec = (base->tv_nsec) + deltaNanos;
    //todo: deal with nanos overflow
}

bool isDeadlineReached(struct timespec* deadline) {
    struct timespec currentTime;
    clock_gettime(CLOCK_TYPE, &currentTime);

    return currentTime.tv_sec >= deadline->tv_sec && currentTime.tv_nsec >= deadline->tv_nsec;
}

int initializeCond(pthread_cond_t *cond) {
    pthread_condattr_t cond_attr;
    if (pthread_condattr_init (&cond_attr)) {
        return -1;
    }
    if (pthread_condattr_setclock (&cond_attr, CLOCK_TYPE)) {
        return -1;
    }
    if (pthread_cond_init (cond, &cond_attr)) {
        return -1;
    }
    return 0;
}

int main() {
    int pid = getpid();
    printf("PID: %in", pid);
    sleep(10);
    printf("Sleeping finishedn");

    int deltaNanos = 2000;
    pthread_cond_t cond;
    pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

    if (initializeCond(&cond)) {
        return -1;
    }

    struct timespec start, end, deadline;
    if (pthread_mutex_lock(&lock)) {
        return -1;
    }
    clock_gettime(CLOCK_TYPE, &start);
    addNanos(&start, &deadline, deltaNanos);

    int iteration = 0;
    while (!isDeadlineReached(&deadline)) {
        pthread_cond_timedwait(&cond, &lock, &deadline);
        iteration++;;
    }

    clock_gettime(CLOCK_TYPE, &end);
    printf("Iteration count: %in", iteration);
    printTime(&start);
    printTime(&deadline);
    printTime(&end);

    //todo: ns overflow handling
    printf("Elapsed time: %ldn", (end.tv_nsec- start.tv_nsec));
    return 0;
}

The code does very little: It acquires a mutex and then it waits on a condition with a timeout. Given there is no other thread to signal the condition then the thread should unblock only after the timeout. POSIX API allows spurious wake-ups, but we can ignore them in this experiment. A long story short: Even when the code uses a timeout of just 5,000 ns, it sleeps for almost 60 microseconds. The same behaviour as I observed in Java! This is a strong indication JVM/JDK is not causing the 50 μs step and the root cause is deeper.

Going Deeper

While running my C program with strace I noticed the `pthread_cond_timedwait()` was implemented as a futex syscall. Again, this is not very surprising:

$ 
strace -e futex ./sleepgranularity 
PID: 26955
Sleeping finished
futex(0x7fff800b3ad8, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=4084}) = -1 ETIMEDOUT (Connection timed out)

It’s interesting the `tv_nsec` parameter is less than my configured timeout which happens to be 5,000 ns. I assume it’s because `pthread_cond_timedwait()` uses absolute time as a timeout while futexes uses relative time and perhaps POSIX Threads impl also does a little bit of busy-spinning before handing over to the kernel. This is just speculation as I didn’t really check it in the source code.

Ok, so now we know `LockSupport.park()` uses futex syscall under the hood and very likely it’s our Linux kernel causing this mysterious 50 μs step. So once again: We Have To Go Deeper!

Dive into the Kernel

The futex implementation in the kernel uses hrtimer for timeouts. Hrtimer stands for “High-Resolution Timer” and it is API inside kernel for … well, timers. Timeout enforcement is a textbook use-case for timers. You can read more about Linux timers in the kernel documentation. At this point, I wondered: Is there any way to configure hrtimer granularity? It turns out there is! Since version 4.6 Linux allows to configure “process slack value” via the `/proc/` filesystem! The slack value is used to group timers which are “close to each other” and minimize the number of wake-ups and guess what? The default value happens to be 50 μs – the same as the step we observed!

We can start with the experiment in C as it has less moving parts than the Java experiment. If you read carefully the source code above then you noticed it first prints its own PID and then it sleeps 10 seconds before calling the `pthread_cond_timedwait()`. The sleeping is there for a reason. It is to give us a chance to modify to slack value via the `/proc/` filesystem. It’s actually dead simple, let’s try: `# echo 200000 > /proc//timerslack_ns` This increases the default slack time from 50,000 ns to 200,000 ns and the `pthread_cond_timedwait()` should return much later then the usual 50-60 μs. Let’s run it!

`PID: 32644 Sleeping finished
Iteration count: 1
21090.742608800
21090.742615800
21090.742895275
Elapsed time: 286475`

You can see it took well over 200 μs! Let’s try it the other way around – decreasing this value should make `pthread_cond_timedwait()` to return faster.

`# echo 1000 > /proc/702/timerslack_ns`

`$ ./sleepgranularity PID: 702 Sleeping finished
Iteration count: 1
21311.855077826
21311.855079826
21311.855080608
Elapsed time: 2782`

It returned in less than 3 μs.

Conclusion #1

We proved it’s possible to decrease the 50 μs granularity step by changing the `timerslack_ns` value!

Back to Java

Let’s try the same with Java. It turns out to be a bit more complicated, mostly because Java starts multiple threads and I’ve always been confused by the relations between threads and full-blown processes in Linux. Anyway, here is what worked for me:

1. Start the Java experiment as `java -jar ./target/testvdso-1.0-SNAPSHOT.jar PARK 100000 1000`
2. Find the JVM process with `jps`. It gives me this output: `1387 testvdso-1.0-SNAPSHOT.jar`
3. Find the native thread id of the `main` Java thread with `jstack 1387|grep main “main” #1 prio=5 os_prio=0 tid=0x00007fd6bc012000 nid=0x56c waiting on condition [0x00007fd6c5ada000]`
4. Now we have the native thread ID = `0x56c`. Let’s convert it from hexa: `0x56c = 1388 dec`
5. At this point we can use the `/proc/` filesystem: `# echo 1000 > /proc/1388/timerslack_ns`

**Behold!**

100000 iterations in 5563 ms. This means each iteration took 55 microseconds
100000 iterations in 5530 ms. This means each iteration took 55 microseconds
100000 iterations in 5537 ms. This means each iteration took 55 microseconds
100000 iterations in 5537 ms. This means each iteration took 55 microseconds
100000 iterations in 5537 ms. This means each iteration took 55 microseconds
100000 iterations in 3737 ms. This means each iteration took 37 microseconds
100000 iterations in 525 ms. This means each iteration took 5 microseconds
100000 iterations in 527 ms. This means each iteration took 5 microseconds
100000 iterations in 521 ms. This means each iteration took 5 microseconds

Can you spot the line with the change? The latency went down from 55 microseconds to just 5 micros! Yay!

Conclusion #2

We can shrink the minimum usable thread parking time in Java with a bit of OS tweaking!

Caveats

The research above makes many many assumptions: about JDK `parkNanos()` implementation, POSIX Threads implementation, futexes, hrtimers, etc. Obviously, I assume Linux OS and it’s quite different on other operating systems.

Summary

Even when using a high-level language such as Java, it’s good to understand what is going on at a lower level. Linux offers a great amount of flexibility and tunables and it’s possible to use them to reduce minimum parking time.

Would I use this in production? Not really or at least definitely not without further investigation of all implications. The first thing I can think of is increased CPU load when the step is too small. There might be other side-effects I am not aware of. However, it was tremendous fun to explore it and I learned a lot about quite a few layers of my stack!

If you like to poke systems and learn about low-level details then you can follow me on Twitter!