Debugging Race Conditions in C/C++

Bugs that are hard to reproduce suck up time and energy of any software development team. One cause of these bugs can be race conditions, which can cause erratic and confusing behavior and make getting a reliable bug report nearly impossible.

But there are ways of solving race conditions, either using a careful strategy or adding some useful debugging tools and this can save a huge amount of time and money. We'll cover both the best strategy and the best time saving tools in this post.

What are race conditions?

Race conditions in software or any system occur when the desired output requires that certain events occur in a specific order but that the events don’t always happen in that order. There is a ‘race’ between the events and if the wrong events win, the program fails.

For example, if your code works out a discount by applying a fixed discount of $1 and then a 10% discount on top of that, it's important that those discounts are worked out in that order. But if the $1 discount and 10% discount sometimes happen in the reverse order, the discount will sometimes - apparently randomly - be different:

50 - 1 * 0.9 = 44.1

50 * 0.9 - 1 = 44

Any complex piece of software is going to make extensive use of threads and other causes of concurrency, such as microservices, and so race conditions become more common as systems become more complex. Race conditions are, by their nature, hard to debug because they cause erratic and apparently random behavior which can vary across systems and with different inputs.

Which event wins the race may be determined by external data, memory or some other non-deterministic factor which means unless you are able to reproduce the exact input and timing for both cases, you won't get the same exact output.

Added to this, race conditions don't just cause crashes, they may just change the behavior of the program in tiny and sometimes innocent ways, such as a $0.10 difference in a discount or the program sometimes hanging randomly. When all a software engineer wants is a good description of the bug, a reliable way of reproducing it and an idea of how to fix it, race conditions are a perfect storm of not enough information or certainty about anything.

A quick skim down Stack Overflow's race conditions tag gives an idea of the breadth of causes and pain race condition bugs bring. Googling or searching forums will bring the same sort of responses.

Is this a race condition?

Firstly, how do you know if some erratic behavior is caused by a race condition? Well, race conditions cause erratic behavior, but not all erratic behavior is caused by race conditions. Until you've found the underlying cause and confirmed that the order of two events are causing the race condition, you don't know for sure that it is a race condition. But there are some tells which you can feed into your differential diagnosis.

The bug is likely to be erratic, sometimes fixing itself or behaving differently but apparently at random. This can be enough to drive anyone completely mad.

The behavior may differ in dev, test and production. Production might have far more data, or dev might be running on more powerful machines. These subtle differences point to race conditions.

But really, there’s no way of knowing that race conditions are the cause upfront so you need to dive in and start debugging. There are two fixes: one using traditional debugging tools and the other using time travel debugging.

How to find and fix race condition bugs in C/C++ the hard way

Your starting point is likely to be the program entering a confusing state or just crashing, which may give you a value, some text or a code reference. From this starting point, you need to extract an idea and form a hypothesis about how the program got into this state.

A typical strategy that you can find suggested on many forums on the web is to add extensive extra logging to the program, effectively print out almost every variable so you can map what you expect to what's happening.

At some point, you'll hit upon a critical data structure which is being affected by the race condition - this is your first smoking gun. The data structure will send the program into an unwanted series of events which result in it crashing or some weird state happening.

If you don't have time travel debugging - which I'll explain in a moment - then you need to use this knowledge of the data structure to find out what other parts of the program are likely to be affecting it.

You next need to find the parts of the program which are racing to change the data structure, hopefully leading you to successfully find and fix the race condition. The name of the data structure and the contents can be useful in finding these parts of the program.

The name of the data structure may be referred to throughout the rest of the program, perhaps leading you to find a few places which are racing and independently changing the value. But the contents of the data structure may also help you. For example, in the discount calculation example I gave it could be that you can simply look at the rest of the calculation and form an idea about where in the program discounts are calculated and simply walk through each of them. While this is methodical, it’s painstaking, expensive and prone to human error when you miss the bug in hours of reading over code.

When you have an idea about where the race conditions are coming from, it's time to test your theory; put in a breakpoint and step through until you see the race happening which will be where one part of the code runs before you expect it to or a value isn’t what you expect. If you have an idea of what the right values in the data structure are then you could set a conditional breakpoint which will speed things up.

Through one of these means - recognizing the unexpected data, finding some likely candidates for code that's corrupting data or setting conditional breakpoints - you will find a candidate for the code which is breaking things. However, because this is a race condition you might not be able to replicate the issue easily - so, even with the best information you might still have to resort to running a test many times until you've found the underlying cause.

How to find and fix race condition bugs in C/C++ using time travel debugging

There is another way which involves using time travel debugging (a.k.a. reverse debugging). 

Reverse debugging is the ability of a debugger to stop after a failure in a program has been observed and go back into the history of the execution to uncover the reason for the failure.

Without a good debugging tool, there's no simple method to attack hard to reproduce bugs and it can quickly descend into a brute force attempt to log ever more values until you can see what's happening in the program. You may have to execute the code multiple times in the hope of seeing the bug, perhaps introducing randomness in the inputs or guessing at what the underlying cause is.

Time Travel Debugging
Find and fix test failures in minutes - save time on debugging C/C++. Try for yourself.
Get Free Trial >>

So, for hard to reproduce bugs like race conditions, reverse debugging allows you to start from the point of failure and step back to find the cause. This is a very different and much more satisfying method, which is worth walking through to see the impact time travel debugging has on debugging race conditions.

We’re basing this on an example program you can grab from the end of this post.

First, you don't necessarily even need to identify where the program fails. You start UDB (our time travel debugger), pointing it at the program:

./udb race

Start by running the program and see where it goes wrong:

not running> run
threadfn1: i=20000
threadfn1: i=30000
race: race.cpp:47: void threadfn1(): Assertion `g_value == old_value + a' failed.
[New Thread 108.120]
[New Thread 108.119]
[New Thread 108.121]

Thread 2 received signal SIGABRT, Aborted.
[Switching to Thread 108.120]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	}

In our example, the error is all about the

Assertion `g_value == old_value + a' failed

We want to know why one thread expects one value but gets another. We use reverse-finish to step back out of the C standard library until we arrive at a line in our program:

recording 9,241,049> reverse-finish
0x00007f5b70bf191c in __GI_abort () at abort.c:79
79	      raise (SIGABRT);
99% 9,241,043> reverse-finish
92	  abort ();
99% 9,241,040> reverse-finish
101	  __assert_fail_base (_("%s%s%s:%u: %s%sAssertion `%s' failed.\n%n"),
99% 9,240,636> reverse-finish
0x00005601508016ee in threadfn1 () at race.cpp:47
47	        assert(g_value == old_value + a);

This brings us to where our program aborted, but doesn’t tell us why. We know that when this line executes the value of g_value won’t be equal to old_value + a so we want to find out what other part of the program altered the value. At this point, we can start to see that it’s a race condition. The few lines leading up to the assert should see a be added to g_value but by the time we hit the assert this isn’t the case, so another thread must have changed g_value at the same time. So, it must be a race condition.

int a = random_int(5);
g_value += a;
assert(g_value == old_value + a);

We find this out by putting a watchpoint on g_value:

99% 9,240,591> watch g_value

Now we’re paused at the point before the program aborted and we know that the cause of the problem is g_value. Remember the approach before - wading through the code to find references to g_value. Contrast this with what we do next which is to run the program in reverse (automatically, not manually) watching for where g_value is changed:

99% 9,240,591> reverse-continue

Thread 2 hit Hardware watchpoint 1: g_value

Old value = 756669
New value = 756668
0x00005601508016bc in threadfn1 () at race.cpp:46
46	        g_value += a;

UDB pauses the program, but looking at the context shows that this is just the line before in the same thread so isn’t that interesting:

99% 9,240,591> list
41	            std::cout << __FUNCTION__ << ": i=" << i << "\n";
42	        }
43	        /* Increment . Should be safe because we own g_mutex. */
44	        int old_value = g_value;
45	        int a = random_int(5);
46	        g_value += a;
47	        assert(g_value == old_value + a);
49	        (void)old_value;
50	    } we type reverse-continue again:

99% 9,240,591> reverse-continue
[Switching to Thread 218.231]

Thread 4 hit Hardware watchpoint 1: g_value

Old value = 756668
New value = 756667
0x000056015080179e in threadfn2 () at race.cpp:62
62	        g_value += 1; /* Unsafe. */

Now, the "Unsafe" comment isn't going to be in a real system but you get the idea. We have arrived at the line which is causing the race conditions:

99% 9,240,581> list
57	    {
58	        if (i % (100) == 0)
59	        {
60	            std::cout << __FUNCTION__ << ": i=" << i << "\n";
61	        }
62	        g_value += 1; /* Unsafe. */
63	        usleep(10 * 1000);
64	    }
65	}

And that’s it! We’ve found the offending line.

To recap, that took just 5 steps:

1. Run the program in UDB
2. The program aborted
3. We type reverse-finish to get back into the program and discover that there are race conditions on g_value
4. We set a watch point on g_value and...
5. reverse-continue until we find the offending line

Of course this is a tiny program, but the principle of time travel debugging is true for larger programs. Rather than wading through lines of code and guessing at what line is accessing what variable, we find the problem and step back until the cause is found.

To get started with time travel debugging, try UDB or read more about how time travel debugging (also called reverse debugging) dramatically reduces debugging times on hard to reproduce problems.

The code we used

#include <assert.h>
#include <unistd.h>

#include <iostream>
#include <mutex>
#include <random>
#include <thread>

 * Returns a random integer in the range [0, max]
static int
random_int(int max)
    static std::random_device rd;
    static std::mt19937 generator(rd());
    static std::uniform_int_distribution<int> distribution(0, max);

    return distribution(generator);

static int g_value = 0;

static std::mutex g_mutex;

    for (int i = 0;; ++i)
        /* Take ownership of g_mutex for the duration of this scoped block. */
        const std::lock_guard<std::mutex> lock(g_mutex);

        if (i % (10 * 1000) == 0)
            std::cout << __FUNCTION__ << ": i=" << i << "\n";
        /* Increment <g_value>. Should be safe because we own g_mutex. */
        int old_value = g_value;
        int a = random_int(5);
        g_value += a;
        assert(g_value == old_value + a);


    for (int i = 0;; ++i)
        if (i % (100) == 0)
            std::cout << __FUNCTION__ << ": i=" << i << "\n";
        g_value += 1; /* Unsafe. */

    std::thread t1(threadfn1);
    std::thread t2(threadfn1);
    std::thread t3(threadfn2);


    return EXIT_SUCCESS;

Learn more about UDB's reverse debugging capabilities.

Find and fix bugs in minutes with time travel debugging

Learn more