Expose Concurrency Bugs With Thread Fuzzing

Expose Concurrency Bugs With Thread Fuzzing

Threaded codebase

 

 

Wikipedia defines fuzz testing as:

an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. As programs get larger and more complex they often need to process data or execute instructions concurrently, which can lead to non-deterministic defects that arise when execution of one thread interferes with another.

LiveRecorder comes with a capability that we call Thread Fuzzing which manipulates the way that threads are scheduled, making concurrency bugs that are rare in normal conditions become statistically more common – allowing them to be found before they hit customers.

Several LiveRecorder users have enabled Thread Fuzzing in unit tests to see if any new bugs were exposed, and 100% of them reported that thread fuzzing revealed hitherto undiagnosed race conditions. Because the failures were recorded, they were significantly easier (and faster!) to fix.

We had written a unit test that stress-tested a newly added piece of concurrent code and there was some issue we couldn’t understand. We couldn’t reproduce the issue at first, but using LiveRecorder with thread-fuzzing turned on, we were able to reproduce the issue within minutes.

Andreas Erz, Software Developer at SAP

 

Learn about LiveRecorder

 

This article below explains how to use Thread Fuzzing in LiveRecorder to improve software quality.

Common Concurrency Bugs

Non-deterministic factors like thread switching and external data can affect the order or timing of thread execution, resulting in concurrency issues that cause unpredictable application behavior such as miscalculations, crashes or hangs.

Below are some examples of common concurrency defects.

Atomicity Violation

If the execution of Thread 1 below is interrupted by Thread 2 immediately after evaluating the branch condition, then Thread 1 could crash with a memory access violation.

Thread 1:

if (foo->bar)
{
    do_something(foo->bar);
}

Thread 2:
foo->bar = NULL;

Deadlock

It is possible for Thread 1 to wait indefinitely for Thread 2 to unlock L2, while Thread 2 is waiting for Thread 1 to release L2. Pared back to its simplest case:

Thread 1:

pthread_mutex_lock(L1); 
pthread_mutex_lock(L2);

Thread 2:

pthread_mutex_lock(L2); 
pthread_mutex_lock(L1);

Race Condition

A race condition is a type of software defect that occurs when separate threads interact in an unforeseen way and disrupt the expected timing and ordering of operations.

For example, where two threads try to change shared data at the same time, leading to unpredictable system behavior. That is, multiple threads are in “a race” and different threads might win the race depending on non-deterministic events.

Thread Fuzzing in LiveRecorder

Issues like these are difficult and time consuming to recreate and investigate. Thread Fuzzing can help capture them before they are released into production.

In LiveRecorder, the Thread Fuzzing feature manipulates the way that threads are scheduled. A number of different Thread Fuzzing modes are available and you can configure one or more of these modes to be used when recording your application:

  • Thread Starvation (UNDO_tf=starve)
  • Randomising thread slices (UNDO_tf=random)
  • Switching inside basic blocks (UNDO_tf=in-bb)
  • Switches around locking/syncing instructions (UNDO_tf=sync-instr)

Thread Starvation

A common type of concurrency bug is due to ordering problems, for instance when there’s a fast data-generating thread and a slower second thread consuming that data. The consumer thread, being slower, tends to always have data to consume, so noticing bugs is rare. However, if the consumer thread overtakes the generator thread, for instance due to slow I/O, an error might occur.

char* array[100] = {0};

void generator_thread() {
    for (int i = 0; i < 100; i++) {
        array[i] = strdup("Hello world\n");
    }
}

void consumer_thread() {

    for (int i = 0; i < 100; i++) {
        // Error: the consumer can overtake the generator
        // and call puts() on NULL!
        puts(array[i])
    }
}

Thread Fuzzing’s starve mode encourages race conditions by randomly picking some threads and preventing them from making progress for a short period of time.

Randomizing Thread Slices

Normally, LiveRecorder lets a thread run for a fixed amount of basic blocks before letting the kernel switch to another thread. The random mode makes the length of these runs random and often much shorter to increase the number of thread switches. This increases the chances of these threads interrupting each other.

Switching Inside Basic Blocks

A basic block is a code sequence with one entry point (nowhere inside is the destination of a jump instruction elsewhere) and one exit point (only the last instruction can cause the program to begin executing in a different block). By default, LiveRecorder doesn’t allow basic blocks to be interrupted. Thread switches can only happen at basic block boundaries where the code does branch.

This may hide bugs happening due to having an inconsistent status for a very short amount of time, for instance:

volatile int value1 = 0;
volatile int value2 = 0;

void setter_thread() {
    for (int i = 0; i < 100; i++) {
        value1 = i;
        value2 = i;
    }
}

void checker_thread() {
    for (int i = 0; i < 100; i++) {
        assert(value1 == value2);
    }
}

The above code would never fail in LiveRecorder with default settings as the setter thread can never be interrupted between the two assignments. The in-bb Thread Fuzzing mode allows thread switches to happen anywhere, making it possible to reproduce this kind of bug.

Switches Around Locking/Syncing Instructions

Basic locking functionalities and atomic operations, for instance gcc’s __atomic_* functions or pthread mutexes, are generally implemented using machine instructions that are mainly used in this context (for instance, Intel’s cmpxchg instruction).

By allowing extra thread switches around these specific instructions we can make it more likely that another thread will be run at this point, potentially exposing concurrency bugs.

New call-to-action

Configuring Thread Fuzzing in LiveRecorder

Thread Fuzzing is a feature of LiveRecorder. It isn’t included in UDB because the feature is targeted at hard-to-reproduce issues, and so is more effective in automatic scenarios. Don’t waste engineering time; let the machine do the work for you!

It can be enabled with a command line option, which by default uses all thread fuzzing components:

$ live-record --thread-fuzzing <other args> <process to record>

If you want a more detailed way of controlling it, you can use the environment variable inline instead:

$ UNDO_tf=starve,random,in-bb,sync-instr live-record <args> <process to record>

The UNDO_tf environment variable is a comma-separated list: you can enable individual components or a combination.

To enable Thread Fuzzing via a LiveRecorder API session, include undolr_thread_fuzzing.h and call undolr_thread_mode_set() with a bitmask of the desired components to enable.

LiveRecorder can be set to record on failure, allowing only the misbehaving instances to be saved. To get the most out of Thread Fuzzing in your test pipeline, tests which may expose concurrency bugs should be run many times until a failure happens and then the recording can be analyzed to discover the root cause.

Concurrency Defect Resolution


Stay informed. Get the latest in your inbox.