How long do short sleeps actually take?

There comes a time in the life of almost any C++ programmer, where one of the various sleep functions raises its head. Most of the time the problem boils down to some kind of polling algorithm, for example waiting for a resource and wanting to let other processes work in the meantime1.

While it is not very accurate in general, predicting what happens with a sleep that takes a hundred milliseconds or more, is usually fairly simple. This post will concern itself with the extreme low values, primarily zero and the lowest non-zero value the specific sleep function will accept.

Intuitively, a sleep of zero time means that the currently running thread of execution allows the scheduler the chance to schedule some other thread that actually may have better work to do - like release the resource it is waiting for. This means that, for a system with low load, this sleep should usually take about the time of a context switch.

When choosing the smallest non-zero time, we can argue that the result should not be much different, but if both versions would adhere to expectations, this article would be pretty darn useless...

Setup

The Windows and Linux experiments were conducted on a dual Intel Xeon X5680 system providing a whole bunch of cores. The OS X experiments were conducted on a 2.8 GHz Intel Core i7 "Macbook Pro (Retina, 13-inch, Late 2013)" providing 4 logical cores.

Everything was compiled for x64 and configured to represent a typical release build. The total system CPU load was usually in the range of 2-5%. All experiments were repeated at least 20 times in an interleaved fashion and 99.9% confidence intervals are given for each one. Where not otherwise noted, results are normalized to one execution of the sleep function.

All operating systems were "lived in", without any intentional changes to system clock resolution or similar mechanisms. Hopefully this represents the typical use case better than a virgin system fresh out of the box. Similarly, the system was not sent into a benchmark mode where as many programs as possible are disabled. For example, they continuously played music and had a browser pointed open with an editor in which I was writing this article.

The test program used to give the ground truth is:

#define SLEEP(x) static_cast<void>((x))  
#include <chrono>
#include <iostream>

int main() {  
    unsigned t = 0;
    auto start = ::std::chrono::steady_clock::now();
    for(unsigned i = 0; i < 3000; ++i) {
        SLEEP(1);
        t ^= i; // prevent overeager optimization
    }
    auto stop = ::std::chrono::steady_clock::now();
    auto elapsed = ::std::chrono::duration_cast<::std::chrono::nanoseconds>(stop - start);
    ::std::cout << elapsed.count() << "\n";
    ::std::cerr << t << "\n";
}

The modifications for the individual sleep functions simply added any required headers and replaced the definition of the SLEEP macro with a version that invokes the appropriate sleep function instead. For example, the version relying on the C++11 sleep facilities is:

#include <thread>  
#define SLEEP(x) ::std::this_thread::sleep_for(::std::chrono::nanoseconds((x)))
#include <chrono>
#include <iostream>

int main() {  
    unsigned t = 0;
    auto start = ::std::chrono::steady_clock::now();
    for(unsigned i = 0; i < 3000; ++i) {
        SLEEP(1);
        t ^= i; // prevent overeager optimization
    }
    auto stop = ::std::chrono::steady_clock::now();
    auto elapsed = ::std::chrono::duration_cast<::std::chrono::nanoseconds>(stop - start);
    ::std::cout << elapsed.count() << "\n";
    ::std::cerr << t << "\n";
}

Be aware that the standard mandates that ::std::this_thread::sleep_for may block execution longer than intended, but not shorter. The standard also suggests that this function use a steady clock, which is the reason why the benchmark code does not use a high-resolution clock.

Windows 10

All code for Windows 10 was compiled by Visual Studio 2015, with Visual C++ 19.00.23026.

For this OS, we will use two platform-specific sleep function in addition to ::std::this_thread::sleep_for: Sleep and SleepEx (with its second parameter set to FALSE). Both functions are described to basically behave the same in this test: When given 0, they will yield execution without sleeping and when given 1 they will take any time up to one system clock tick.

Since WINAPI functions only take arguments with millisecond resolution, ::std::this_thread::sleep_for will be performed in two variations: Once with a nanosecond argument and once with a millisecond argument.

The target system had a system clock resolution of::

ClockRes v2.0 - View the system clock resolution
Copyright (C) 2009 Mark Russinovich
SysInternals - www.sysinternals.com

Maximum timer interval: 15.625 ms
Minimum timer interval: 0.500 ms
Current timer interval: 1.001 ms

With a ground truth of less than one nanosecond per iteration (980 ± 61 nanoseconds per 3 000), we will first look at the cases where the sleep functions were explicitly asked to perform a zero duration sleep:

  • ::std::this_thread::sleep_for with 0 nanoseconds: 130 ± 1 ns
  • ::std::this_thread::sleep_for with 0 milliseconds: 132 ± 5 ns
  • Sleep: 64 ± 1 ns
  • SleepEx: 69 ± 11 ns

As expected, there is a certain cost for yielding execution, clocking in at less than 150 ns per sleep. It should also not come as a big surprise, that the C++ standard library function has a higher overhead than the direct WINAPI calls.

Now the results for the minimal non-zero argument:

  • ::std::this_thread::sleep_for with 1 nanosecond: 1 535 253 ± 19 313 ns
  • ::std::this_thread::sleep_for with 1 millisecond: 2 000 969 ± 194 ns
  • Sleep: 2 000 949 ± 135 ns
  • SleepEx: 2 000 911 ± 134 ns

All functions targeting a single millisecond yield the same result, hitting 2 milliseconds instead of one.

I was surprised by the result of ::std::this_thread::sleep_for when given a 1 nanosecond argument, as it only takes ¾ of the time that either native solution requires for its smallest argument. It should be noted however, that both relative and absolute error are larger though2.

Concluding: Out of these alternatives, ::std::this_thread::sleep_for performs best in general, as its interface alleviates much of the pain associated with the older APIs. Still, Sleep/SleepEx offer a better performance when only yielding execution.

Linux

The operating system used was an Arch Linux identifying its kernel release as 4.1.6-1-ARCH. All code was compiled using g++ version 5.2.0.

For this operating system, we will discuss three different native methods in addition to ::std::this_thread::sleep_for. The obvious choice is nanosleep3, additionally we will use the timeout of pselect4 and the timerfd facility. The timerfd functionality was tested in three distinct configurations: Recreating the timerfd every call, reusing one timerfd but letting it only fire once, and finally by preparing the timerfd with an interval timer in advance. As all these timer APIs have nanosecond resolution, the chosen inputs will be 0 and 1 nanoseconds. Additionally, sched_yield is evaluated as a 0 ns sleep.

This operating system exhibits a ground truth of less than one nanosecond per iteration (604 ± 44 nanoseconds per 3 000).5

For the first set of benchmarks, in which the effect with a zero argument is evaluated, the timerfd family of timers will not be present, as their API makes this usage impossible6:

  • ::std::this_thread::sleep_for: 0 ± 1 ns per iteration (612 ± 35 ns per 3000)
  • nanosleep: 498 577 ± 427 ns
  • pselect: 136 ± 7 ns
  • sched_yield: 164 ± 8 ns

Right off the bat: ::std::this_thread::sleep_for requires not statistically significant more time than the ground truth - and definitely not enough for a system call. It would seem as if this were completely handled in user-space, thus not actually yielding execution at all.

Interestingly, pselect performs slightly better than sched_yield, which may be due to better optimized code, dumb luck, or because it does not actually yield execution - after all it is not primarily intended to yield execution, but to wait upon an event.7

Finally, nanosleep performs significantly worse than sched_yield, probably making it the wrong tool for yielding execution.

Going on, here are the results for a 1 nanosecond sleep:

  • ::std::this_thread::sleep_for: 498 628 ± 263 ns
  • nanosleep: 498 693 ± 353 ns
  • pselect: 498 796 ± 398 ns
  • timerfd recreating: 4 819 ± 182 ns
  • timerfd reusing: 3 273 ± 255 ns
  • timerfd interval: 2 783 ± 163 ns

It seems that ::std::this_thread::sleep_for, nanosleep and pselect are provided by the same underlying mechanism - which is outperformed by several orders of magnitude by the timerfd API. It can also be noticed that nanosleep seems to treat a 0 ns sleep the same as a 1 ns sleep, unlike the Windows sleep functions that explicitly treat this as a yield only.

There is no real surprise in the relative performance of the timerfd variants themselves: The most general usage case is slowest (although still blazingly fast), with the reuse of the file descriptor saving a lot of work, and the switch to intervals making it faster yet, although it also becomes rather inflexible.

At this point it should be noted that the actual sleeping on the timerfd is done via read, meaning it is not guaranteed to yield execution, especially in the interval case where the file descriptor may already be ready when read is invoked. Still, for this benchmark, I was able to verify that about 3000 context switches do take place during the execution of the timerfd in interval using GNU Time 1.7.

Concluding the Linux analysis: To yield execution, it seems safest to use sched_yield, which performs slightly worse than the pselect alternative. To perform short sleeps, the use of timerfd timers is far superior to all other variants, as a timerfd with minimal time returns two orders of magnitude quicker than nanosleep with any time.

OS X

The exact OS X version used for this test was 10.10.5, as El Capitan was not yet available at the time of writing. Be reminded that this test was run on different hardware which must be taken into account when comparing it to the Linux and Windows tests.

The test suite was fairly similar to the Linux one, but the timerfd suite had to be removed as that particular facility is not available on OS X.

This operating system also exhibits a ground truth of less than one nanosecond per iteration (477 ± 27 ns per 3 000).

Beginning with the zero-duration sleeps:

  • ::std::this_thread::sleep_for: 4 ± 1 ns (10 680 ± 670 ns per 3000)
  • nanosleep: 1 086 ± 32 ns
  • pselect: 412 ± 13 ns
  • sched_yield: 180 ± 35 ns

Again, we see a conspicuously low value for ::std::this_thread::sleep_for, suggesting that OS X does not actually perform a sleep here. Maybe the most surprising result is how good both nanosleep and pselect perform, compared to sched_yield.

Now the numbers for a 1 nanosecond sleep:

  • ::std::this_thread::sleep_for: 13 809 ± 186 ns
  • nanosleep: 14 831 ± 234 ns
  • pselect: 416 ± 12 ns

For this test, all methods used leave those available on other platform far in the dust. In fact, only Linux's timerfd facilities manage to come close – and they are still beaten by the OS X pselect by almost an order of magnitude. Additionally, unlike on Linux, great performance is available for all tested methods, including nanosleep, which is after all the obvious choice in C style code and ::std::this_thread::sleep_for, which is the obvious choice for C++ style code.

Summing up the OS X results, it is obvious that this operating system has all others beat, when it comes to short sleeps. While nanosleep performs somewhat worse than pselect, its purpose is more obvious and it can be easily used to continue sleeping in the presence of interrupts.

Conclusion

Interestingly, the results were mixed for Windows and Linux: Windows 10 seems to bring primitives to the table that perform very well when only yielding execution, but lack in resolution when actually sleeping. Linux on the other hand provides the timerfd API, which allows extremely short sleeps when a sleep is actually requested. However, the winner of this articly clearly is called OS X, handily beating both alternatives in every single category.

The test program, all results and the script used to analyze them can be downloaded here.

Footnotes:

  1. In most cases a blocking wait should be preferred. Would life not be great if we had the pleasure of always being easily able to do things the right way?

  2. The absolute error of the 1 millisecond sleeps is about 1.0 ms, while the 1 nanosecond sleep is off by about 1.5 ms. The relative error differs by roughly 6 orders of magnitude.

  3. sleep has second resolution and usleep is deprecated.

  4. If you are wondering why the heck I am analyzing pselect of all possible functions sporting a timeout, I stumbled over an answer on stackoverflow that hinted it might by worth evaluating.

  5. Interestingly this is only about ⅔ of the ground truth for Windows 10, possibly due to more aggressive optimization by g++ versus Visual C++.

  6. When setting the time to zero, it disables the timerfd completely, meaning that waiting on it will take forever.

  7. I had to run these specific benchmarks significantly more often than the rest to get the confidence intervals small enough to not overlap.

Daniel Schemmel

is currently employed at the Chair of Communication and Distributed Systems at RWTH Aachen University, where he researchs the testability of distributed systems. He can be reached at blog(at)gha.st.

Aachen, Germany, Terra, Sol, Milky Way, Laniakea SC https://gha.st/about/