Resources

Debugging race conditions in C/C++

Bugs that are hard to reproduce suck up time and energy of any software development team. One cause of these bugs can be race conditions, which can cause erratic and confusing behaviour and make getting a reliable bug report nearly impossible.

But there are ways of solving race conditions, either using a careful strategy or adding some useful debugging tools and this can save a huge amount of time and money. We'll cover both the best strategy and the best time saving tools in this post.


What are race conditions?

Race conditions in software or any system occur when the desired output requires that certain events occur in a specific order but that the events don’t always happen in that order. There is a ‘race’ between the events and if the wrong events win, the program fails.

For example, if your code works out a discount by applying a fixed discount of $1 and then a 10% discount on top of that, it's important that those discounts are worked out in that order. But if the $1 discount and 10% discount sometimes happen in the reverse order, the discount will sometimes - apparently randomly - be different:

50 - 1 * 0.9 = 44.1

50 * 0.9 - 1 = 44

Any complex piece of software is going to make extensive use of threads and other causes of concurrency, such as microservices, and so race conditions become more common as systems become more complex. Race conditions are, by their nature, hard to debug because they cause erratic and apparently random behaviour which can vary across systems and with different inputs.

Which event wins the race may be determined by external data, memory or some other non-deterministic factor which means unless you are able to reproduce the exact input and timing for both cases, you won't get the same exact output.

Added to this, race conditions don't just cause crashes, they may just change the behaviour of the program in tiny and sometimes innocent ways, such as a $0.10 difference in a discount or the program sometimes hanging randomly. When all a software engineer wants is a good description of the bug, a reliable way of reproducing it and an idea of how to fix it, race conditions are a perfect storm of not enough information or certainty about anything.

A quick skim down Stack Overflow's race conditions tag gives an idea of the breadth of causes and pain race condition bugs bring. Googling or searching forums will bring the same sort of responses.


Is this a race condition?

Firstly, how do you know if some erratic behaviour is caused by a race condition? Well, race conditions cause erratic behaviour, but not all erratic behaviour is caused by race conditions. Until you've found the underlying cause and confirmed that the order of two events are causing the race condition, you don't know for sure that it is a race condition. But there are some tells which you can feed into your differential diagnosis.

The bug is likely to be erratic, sometimes fixing itself or behaving differently but apparently at random. This can be enough to drive anyone completely mad.

The behaviour may differ in dev, test and production. Production might have far more data, or dev might be running on more powerful machines. These subtle differences point to race conditions.

But really, there’s no way to knowing that race conditions are the cause upfront so you need to dive in and start debugging. There are two fixes: one using traditional debugging tools and the other using record, rewind and replay debugging tools.


How to find and fix race condition bugs in C/C++: The hard way

Your starting point is likely to be the program entering a confusing state or just crashing, which may give you a value, some text or a code reference. From this starting point, you need to extract an idea and form a hypothesis about how the program got into this state.

A typical strategy that you can find suggested on many forums on the web is to add extensive extra logging to the program, effectively print out almost every variable so you can map what you expect to what's happening.

At some point, you'll hit upon a critical data structure which is being affected by the race condition - this is your first smoking gun. The data structure will send the program into an unwanted series of events which result in it crashing or some weird state happening.

If you don't have reversible debugging - which I'll explain in a moment - then you need to use this knowledge of the data structure to find out what other parts of the program are likely to be affecting it.

You next need to find the parts of the program which are racing to change the data structure, hopefully leading you to successfully find and fix the race condition. The name of the data structure and the contents can be useful in finding these parts of the program.

The name of the data structure may be referred to throughout the rest of the program, perhaps leading you to find a few places which are racing and independently changing the value. But the contents of the data structure may also help you. For example, in the discount calculation example I gave it could be that you can simply look at the rest of the calculation and form an idea about where in the program discounts are calculated and simply walk through each of them. While this is methodical, it’s painstaking, expensive and prone to human error when you miss the bug in hours of reading over code.

When you have an idea about where the race conditions are coming from, it's time to test your theory; put in a breakpoint and step through until you see the race happening which will be where one part of the code runs before you expect it to or a value isn’t what you expect. If you have an idea of what the right values in the data structure are then you could set a conditional breakpoint which will speed things up.

Through one of these means - recognising the unexpected data, finding some likely candidates for code that's corrupting data or setting conditional breakpoints - you will find a candidate for the code which is breaking things. However, because this is a race condition you might not be able to replicate the issue easily - so, even with the best information you might still have to resort to running a test many times until you've found the underlying cause.


How to find and fix race condition bugs in C/C++: Using reverse debugging

There is another way which involves using reverse debugging. As we’ve written before in our brief history of reverse debugging, otherwise known as time-travel debugging:

“Reverse debugging is the ability of a debugger to stop after a failure in a program has been observed and go back into the history of the execution to uncover the reason for the failure.”

Without a good debugging tool, there's no simple method to attack hard to reproduce bugs and it can quickly descend into a brute force attempt to log ever more values until you can see what's happening in the program. You may have to execute the code multiple times in the hope of seeing the bug, perhaps introducing randomness in the inputs or guessing at what the underlying cause is.

So, for hard to reproduce bugs like race conditions, reverse debugging allows you to start from the point of failure and step back to find the cause. This is a very different and much more satisfying method, which is worth walking through to see the impact reversible debugging has on debugging race conditions.

We’re basing this on an example program you can grab from the end of this post.

First, you don't necessarily even need to identify where the program fails. You start UndoDB, our reversible debugger, pointing it at the program:

./udb race_x64

Start by running the program and see where it goes wrong:

(udb) run
...
s_threadfn: it=110000
s_threadfn: it=70000
race_x64: examples/race.cpp:35: void* s_threadfn(void*): Assertion `s_value == old_value + a' failed.
[New Thread 3441.3452]

Program received signal SIGABRT, Aborted.
[Switching to Thread 3441.3452]
0x00007f9c0d0f2c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56    ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.


In our example, the error is all about the

Assertion `s_value == old_value + a' failed

We want to know why one thread expects one value but gets another. We the use

reverse-finish

until we arrive at a line in our program:

(udb) rf
89 in abort.c
(udb) rf
92 in assert.c
(udb) rf
101 in assert.c
(udb) rf
35        assert( s_value == old_value + a);


This brings us to where our program aborted, but doesn’t tell us why. We know that when this line executes the value of s_value won’t be equal to old_value + a so we want to find out what other part of the program altered the value. At this point, we can start to see that it’s a race condition. The few lines leading up to the assert should see `a` be added to `s_value` but by the time we hit the assert this isn’t the case, so another thread must have changed s_value at the same time. So, it must be a race condition.

int a = s_random_int( 5);
s_value += a;
assert( s_value == old_value + a);

We find this out by putting a watch point on s_value:

(udb) watch s_value

Now we’re paused at the point before the program aborted and we know that the cause of the problem is s_value. Remember the approach before - wading through the code to find references to s_value. Contrast this with what we do next which is to run the program in reverse (automatically, not manually) watching for where s_value is changed:

(udb) rc
Continuing.
Hardware watchpoint 1: s_value

Old value = 374200
New value = 374196
0x0000000000400c73 in s_threadfn () at examples/race.cpp:34
34            s_value += a;

udb pauses the program, but looking at the context shows that this is just the line before in the same thread so isn’t that interesting:

(udb) list
           use list to see the context. (This is shown in the TUI window)
29                std::cout << __FUNCTION__ << ": it=" << it << "\n";
30            }
31            /* Increment <s_value>. Should be safe because we own s_mutex. */
32            int old_value = s_value;
33            int a = s_random_int( 5);
34            s_value += a;
35            assert( s_value == old_value + a);
36            pthread_mutex_unlock( &s_mutex);
37        }
38        return NULL;

...so we `reverse-continue` again:

(udb) reverse-continue
Continuing.
[New Thread 8792.8820]
[Switching to Thread 8792.8820]
Hardware watchpoint 1: s_value


Old value = 236250
New value = 236249
0x0000000000400b86 in s_threadfn2 () at examples/race.cpp:50
50            s_value += 1;  /* Unsafe. */


Now, the "Unsafe" comment isn't going to be in a real system but you get the idea. We have arrived at the line which is causing the race conditions:

(udb) list
44          for ( int it=0;; ++it)
45          {
46              if ( it % (100) == 0)
47              {
48                  std::cout << __FUNCTION__ << ": it=" << it << "\n";
49              }
50              s_value += 1;   /* Unsafe. */
51              usleep(10*1000);


And that’s it! We’ve found the offending line.

To recap, that took just five steps:

1. Run the program in undodb
2. The program aborted
3. We `reverse finish` to get back into the program and discover that there are race conditions on s_value
4. We set a watch point on s_value and...
5. Reverse continue until we find the offending line

Of course this is a tiny program, but the principle of reversible debugging is true for larger programs. Rather than wading through lines of code and guessing at what line is accessing what variable, we find the problem and step back until the cause is found.

To get started with Undo, try our demo or read more about how reverse debugging dramatically reduces debugging times on hard to reproduce problems.


The code we used

#include <iostream>
#include <assert.h>
#include <pthread.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>

static int
s_random_int( int max)
{
    return rand() / ( RAND_MAX + 1.0) * max;
}

static int  s_value = 0;

static pthread_mutex_t  s_mutex = PTHREAD_MUTEX_INITIALIZER;

static void*
s_threadfn( void*)
{
    for ( int it=0;; ++it)
    {
        pthread_mutex_lock( &s_mutex);
        if ( it % (10*1000) == 0)
        {
            std::cout << __FUNCTION__ << ": it=" << it << "\n";
        }
        /* Increment <s_value>. Should be safe because we own s_mutex. */
        int old_value = s_value;
        int a = s_random_int( 5);
        s_value += a;
        assert( s_value == old_value + a);
        pthread_mutex_unlock( &s_mutex);
    }
    return NULL;
}

static void*
s_threadfn2( void*)
{
    for ( int it=0;; ++it)
    {
        if ( it % (100) == 0)
        {
            std::cout << __FUNCTION__ << ": it=" << it << "\n";
        }
        s_value += 1;   /* Unsafe. */
        usleep(10*1000);
    }
    return NULL;
}

int
main()
{
    pthread_t   t1;
    pthread_t   t2;
    pthread_t   t3;
    pthread_create( &t1, NULL, s_threadfn, NULL);
    pthread_create( &t2, NULL, s_threadfn, NULL);
    pthread_create( &t3, NULL, s_threadfn2, NULL);

    pthread_join( t1, NULL);
    pthread_join( t2, NULL);
    pthread_join( t3, NULL);

    return 0;
}