How to Debug Linux C++ Race Conditions

Image link

How to Debug Linux C++ Race Conditions

Debugging C++ race conditions is hard. Super hard. There are no silver bullets, but here is a process that you can follow to get to the bottom of it. The following steps are somewhat iterative – you may want to cycle through them in more and more depth until finally the answer reveals itself, and also there is a certain amount of overlap between the steps.

A Step by Step Guide

Step 0: Use the tools

There are some great tools out there to help you – Helgrind and ThreadSanitizer are both really good at detecting certain kinds of races. I know I’m biased, but time travel debugging is super useful, both rr and Undo’s LiveRecorder.

You should use these tools where you can, but sometimes practical considerations mean either you can’t deploy the tools, or they don’t pinpoint the problem. For Helgrind and ThreadSanitizer, this usually means they just don’t detect the race (see below for a discussion of where they work and where they don’t).

With time travel debuggers, if you can capture the race, it’s nearly always straightforward to then find the root-cause of the problem, but sometimes you just cannot capture the race when recording – rr’s Chaos Mode and LiveRecorder’s Thread Fuzzing both mitigate a lot, but neither is perfect. Or maybe it’s just that rr won’t run on your system, and you don’t have a license for LiveRecorder.

So, you’ve tried the tools at your disposal, but to no avail. What next?

Step 1: Figure out how to reproduce at will

We’re going to have to run a series of experiments as we hone in on the root cause. On a good day, we have a test case that fails relatively quickly, but if not, we are going to need some way of catching it in the act so that we can run and rerun and gather more clues. If that means lashing together complicated scripting and running the thing for hours, so be it.

In the worst-case scenario, we can find no way of reproducing other than “in production”. In such a case, we’ll have to run our experiments in production.

See the next section on some tips on how to make bugs more likely to reproduce.

Step 2: Characterize the bug

First off, are you sure it’s even a race condition? Sometimes programmers are too hasty in ascribing any kind of non-deterministic or sporadic failure as a race, but it might be due to any non-deterministic factor such as memory layout, input, etc.

In practice, until they’re root caused, C++ race conditions are a “diagnosis by exclusion”. So first, let’s check for obvious other causes:

  • Run with AddressSanitizer and/or Valgrind (specifically Valgrind’s memcheck tool, which is its default). Does that show anything? Maybe that assumed the data race was actually a buffer overrun. Check out my quick intro to using AddressSanitizer and Valgrind if you’re not familiar with these open-source tools.
  • Disable address space layout randomisation – does the ‘race’ disappear? If so, that doesn’t mean it’s definitely not a race, but it’s probably not. The easiest way to do this is setarch --verbose --addr-no-randomize /bin/bash
  • Make sure every rerun of the failure is given the exact same inputs and configuration. If you can, consider running in a container to minimize system non-determinism.

I remember once having a test that only ever failed in the overnight suite. If run by hand, it was impossible to reproduce. We were sure it was some super subtle race that only happened if the system was under just the right load. It turned out to be a pesky regex parsing bug, which was parsing the time of the day, and only failed when the hour was a single digit, and since programmers tend not to do much before 10am, it never reproduced when run locally!

If we still suspect a race condition, let’s play with timings to see if we can make it more or less likely:

  • Try running the workload pinned to a single CPU core: prefix your reproducer with taskset -c 1. e.g. if your reproducer is to run make test_threads, run it like taskset -c 1 make test_threads. If this significantly changes the frequency of the bug, that implies that it is indeed a race condition; if it makes the bug go away, then it implies we’re looking for a narrow window. If the bug still reproduces on a single CPU and we believe it really is a race condition, then most likely it is some unsafe state that is persisting for a long time.
  • Try forcing more scheduling by running the stress-ng command somewhere else concurrently with your reproducer.
  • Try slowing down the component you suspect of having a race (either certain threads, or in a multi-process system certain processes). This is effectively the same thing as speeding up the other threads/processes and can make races more likely. You could de-prioritize them by applying a high nice value (e.g. nice -p [pid]), or just hack in some sleeps or busy loops at targeted places in the code. If you find that slowing down one component makes the race more likely to bite, then this probably means that component is at least half of the story – some other component is “doing something” to it too fast.
  • Try the TARDIS library. This is a super cool little utility that will cause time-stretching on sleep and gettimeofday() type operations.
  • Try simplifying your test case: if the suspected race can be narrowed down to a particular component, then removing other components will reduce the sources of timing perturbation, making each run more similar – hopefully in a way that still reproduces the race.

Note that these changes may make the race more likely to bite, or less likely, or they may have no effect. Whichever, it gives you a clue about what is going wrong. Obviously there is a lot of overlap here with step 1 – if you can find a way to make the race more likely to bite, then it will become easier to investigate as we can run more experiments.

Step 3: Determine the sequence of events

As with any bug, a race condition is a sequence of events we were not anticipating. An assumption is something you don’t know you’ve made, so we’re looking for evidence that exposes something we didn’t think was possible or didn’t think would be a problem.

  • It might not be glamorous, but our first port of call is probably to add some printf/log statements. However, it’s common for adding print statements to make race conditions go away (sometimes referred to as Heisenbugs). This is not simply because printf or an equivalent can itself be quite slow and so affect timings, but also because it will usually introduce a synchronization point – e.g. libc’s printf will take an internal mutex. To overcome this you could implement a simple in-memory logging system.
  • Add assertions liberally. You might add specific assertions that aren’t suitable for committing as they won’t always hold, but perhaps in this test-case you know that this count variable should always be between 0 and 9. Sprinkling such assertions helps you narrow down the culprit. You may find the nature of the failure starts changing as you add the assertions – sometimes it happens before line 100, other times afterwards. Again, this is telling you something about the nature of the problem.
  • Crack open the debugger. This is almost a heretical statement in some circles, but the debugger (GDB, LLDB, whatever) can help you understand what is happening, and often with far less perturbation on the program than adding print statements (at least until you hit a breakpoint, obviously!). If your code is race free, you should be able to breakpoint and/or single-step through one thread while the other threads run without anything breaking (except for timeouts – it might be worth changing your code to disable or stretch timeouts). Related to step 1, if the bug really does disappear while running under the debugger, this is a clue that it really is a race with quite a narrow window. Most race conditions will reproduce just fine with a debugger attached. Contrary to many people’s understanding, GDB and LLDB will not affect the timing of a multithreaded program other than when:
    • Dynamic shared libraries are loaded or unloaded
    • Threads are created or destroyed
    • Signals are received
    • Breakpoints, watchpoints or catchpoints are hit

If none of the above are happening, then there will be almost no discernable difference when running under the control of a debugger on Linux.

Other Thoughts on Debugging Linux C++ Race Conditions

Most races are not actually data races

The canonical race condition example given in almost every online tutorial, book, or university lecture, looks something like this:

int count = 0;
std::thread incr([&count]() {for (int i = 0; i < 10; i++) count++;});
std::thread decr([&count]() {for (int i = 0; i < 10; i++) count--;});
assert(count == 0);

The increment and decrement on count are non-atomic read-modify-write operations, so the updates are liable to be lost because we get a read-read-modify-modify-write-write sequence – i.e. the first write is overwritten by the second. This is an example of a data race. Easily fixed, just put a mutex around the updates, or use __atomic_fetch_and_add(). This kind of race is easily identified too: just run it through Helgrind or ThreadSanitizer.

However, in my experience, most races are not like this: it’s much more likely to involve the operating system in some way. For example, a signal arriving at an inopportune moment, or a subprocess that almost always completes in time before its result is checked, or a read from the filesystem returning short, or even user input happening too quickly. Unfortunately, dedicated race detection tools like Helgrind, DRD and ThreadSanitizer won’t help here.

Most races are not that subtle

(Related to the above.) Sometimes race conditions really are some nasty, super narrow window like a missing memory barrier, or a time-to-check-to-time-of-use race. 

Usually though, when you finally get to the root cause, it’s some glaring error and you slap your forehead and ask how that didn’t fail more often. Often the race is hiding in plain sight.

Never synchronize using a sleep

I have too often seen** code sleep for, say, 1 second to allow another thread or process or the operating system “plenty of time” to do what it needs, before picking up the result or triggering some action. But one time in a million, that 1 second sleep isn’t enough, so someone bumps it to 5s. Then 10s. The tests slow to crawl, and it’s never enough. Eventually, some day, you find the system was under extreme load, or perhaps NTP has detected the clock needs to run faster, whatever. You really should use proper synchronization primitives, but if you can’t, check the result is ready and retry if it’s not. Fortunately, ThreadSanitizer will detect some uses of sleep as synchronization and warn you. Heed such warnings!

[** OK, I admit it, I’ve done it myself more than a few times. I always lived to regret it.]

It’s almost never a coincidence

This is more of a general debugging tip, but if you see something weird but not necessarily wrong, it’s often tempting to write it off as “well, I guess that could happen, it’s probably not related”. Occasionally, it is indeed a coincidence, a red herring. But 9 times out of 10, if something seems like a coincidence, it’s probably telling you something important: you’ve just misinterpreted what you’re seeing, or there’s an assumption that you didn’t know you’d made.

Don’t ignore the smell of smoke

If you were in your home and you smelled smoke, you’d investigate to find the root cause, you wouldn’t ignore it because you were too busy. Likewise, if you’re debugging some code and you encounter something weird, don’t ignore it, don’t even stuff it in a TODO comment and carry on down your current line of investigation. Allow yourself to be diverted. When you see something wrong but “obviously” unrelated to what you’re currently looking into, it’s tempting to put it to one side because it feels like a distraction. But I find at least half of the time this thing is directly relevant to my current mission. And even if it really is a side track, it’s a ticking time bomb waiting for some other poor soul (quite likely future me) to have to redo all the work I’ve just done to get to this point. 

Explore depth-first. Sometimes the constraints of real life mean it just isn’t possible to do this: sometimes we really are too busy bailing water to fix the holes. But ask yourself, can I really not spend just 20 minutes going down this rabbit hole?

Conclusion: Pick the Right Tool for the Job to Quickly Get to the Root Cause of Race Conditions

C++ race conditions on Linux (and in computing systems in general) are among the most challenging issues to debug for several reasons:

  • Non-deterministic nature: Race conditions depend on the timing and sequence of events, making their occurrence unpredictable and non-deterministic. They tend not to manifest consistently, making it challenging to replicate the issue.
  • Complexity in identification: These issues arise in complex, multi-threaded, or multi-process environments where multiple components interact. Pinpointing the exact cause of the race condition within the code can be intricate due to the interaction of numerous elements.
  • Difficulty in reproduction: As race conditions depend on specific timing and concurrency, replicating the exact conditions under which the issue occurred can be very challenging. It may require specific circumstances that are hard to simulate, making it difficult to reproduce the problem for debugging purposes.
  • Debugging tools limitations: Traditional debugging tools may not always effectively capture or detect race conditions. Monitoring tools might not reveal the subtle timing issues that lead to race conditions, making it harder to identify and isolate the problem.
  • Concurrency complexity: When multiple threads or processes share resources concurrently without proper synchronization, tracing the flow of data and understanding the interactions becomes complex. This complexity adds to the difficulty of debugging race conditions.

Overall, the combination of these factors—non-deterministic behavior, complexity in identification, challenges in reproduction, limitations of debugging tools, intermittent occurrences, and concurrency complexities—makes race conditions in Linux particularly challenging and time-consuming to debug.

Unfortunately, there are no silver bullets – it’s just a hard slog. Fortunately, there are things you can do to make it less awful. There are tools out there that, at least some of the time, will help you get to the root cause a lot more quickly.


New call-to-action

Stay informed. Get the latest in your inbox.