Resources
Expose Concurrency Bugs With Thread Fuzzing
Wikipedia defines fuzz testing as:
an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program. As programs get larger and more complex they often need to process data or execute instructions concurrently, which can lead to non-deterministic defects that arise when execution of one thread interferes with another.
Undo comes with a capability that we call thread fuzzing which manipulates the way that threads are scheduled, making concurrency bugs that are rare in normal conditions become statistically more common – allowing them to be uncovered before they hit customers.
Thread fuzzing is used by engineering teams who want to go the extra-mile in ensuring the codebase is stable and issues are caught early before they go anywhere near deployment.
Thread fuzzing is typically enabled in existing unit tests or continuous integration tests to shake the codebase and exposure hidden race conditions. Because the failures are recorded, they become a lot easier (and faster!) to fix.
To the SAP HANA Cloud engineering team, this technology has become essential to continue ensuring stable releases:
We had written a unit test that stress-tested a newly added piece of concurrent code and there was some issue we couldn’t understand. We couldn’t reproduce the issue at first, but using Undo with thread fuzzing turned on, we were able to reproduce the issue within minutes.
Andreas Erz, Software Developer at SAP [Read full interview]
Thankfully, thread fuzzing is really easy to integrate into your automated test suite: you just drop your executable, no need to replace your synchronization logic and no need to recompile with all sorts of special flags. And unlike ThreadSanitizer, it works at scale without giving you false positives. All in all, it’s a lot less engineering effort.
Common Concurrency Bugs
Non-deterministic factors like thread switching and external data can affect the order or timing of thread execution, resulting in concurrency issues that cause unpredictable application behavior such as miscalculations, crashes or hangs.
Below are some examples of common concurrency defects.
Atomicity Violation
If the execution of Thread 1 below is interrupted by Thread 2 immediately after evaluating the branch condition, then Thread 1 could crash with a memory access violation.
Thread 1: if (foo->bar) { do_something(foo->bar); } Thread 2: foo->bar = NULL;
Deadlock
It is possible for Thread 1 to wait indefinitely for Thread 2 to unlock L2, while Thread 2 is waiting for Thread 1 to release L2. Pared back to its simplest case:
Thread 1: pthread_mutex_lock(L1); pthread_mutex_lock(L2); Thread 2: pthread_mutex_lock(L2); pthread_mutex_lock(L1);
Race Condition
A race condition is a type of software defect that occurs when separate threads interact in an unforeseen way and disrupt the expected timing and ordering of operations.
For example, where two threads try to change shared data at the same time, leading to unpredictable system behavior. That is, multiple threads are in “a race” and different threads might win the race depending on non-deterministic events.
Thread Fuzzing in Undo
Issues like these are difficult and time consuming to recreate and investigate. Thread fuzzing can help capture them before they are released into production.
In Undo, the thread fuzzing feature manipulates the way that threads are scheduled. A number of different thread fuzzing modes are available and you can configure one or more of these modes to be used when recording your application:
- Thread starvation (UNDO_tf=starve)
- Randomising thread slices (UNDO_tf=random)
- Switching inside basic blocks (UNDO_tf=in-bb)
- Switches around locking/syncing instructions (UNDO_tf=sync-instr)
Thread Starvation
A common type of concurrency bug is due to ordering problems, for instance when there’s a fast data-generating thread and a slower second thread consuming that data. The consumer thread, being slower, tends to always have data to consume, so noticing bugs is rare. However, if the consumer thread overtakes the generator thread, for instance due to slow I/O, an error might occur.
char* array[100] = {0}; void generator_thread() { for (int i = 0; i < 100; i++) { array[i] = strdup("Hello world\n"); } } void consumer_thread() { for (int i = 0; i < 100; i++) { // Error: the consumer can overtake the generator // and call puts() on NULL! puts(array[i]) } }
Thread fuzzing’s starve
mode encourages race conditions by randomly picking some threads and preventing them from making progress for a short period of time.
Randomizing Thread Slices
Normally, Undo lets a thread run for a fixed amount of basic blocks before letting the kernel switch to another thread. The random
mode makes the length of these runs random and often much shorter to increase the number of thread switches. This increases the chances of these threads interrupting each other.
Switching Inside Basic Blocks
A basic block is a code sequence with one entry point (nowhere inside is the destination of a jump instruction elsewhere) and one exit point (only the last instruction can cause the program to begin executing in a different block). By default, Undo doesn’t allow basic blocks to be interrupted. Thread switches can only happen at basic block boundaries where the code does branch.
This may hide bugs happening due to having an inconsistent status for a very short amount of time, for instance:
volatile int value1 = 0; volatile int value2 = 0; void setter_thread() { for (int i = 0; i < 100; i++) { value1 = i; value2 = i; } } void checker_thread() { for (int i = 0; i < 100; i++) { assert(value1 == value2); } }
The above code would never fail in Undo with default settings as the setter thread can never be interrupted between the two assignments. The in-bb
Thread fuzzing mode allows thread switches to happen anywhere, making it possible to reproduce this kind of bug.
Switches Around Locking/Syncing Instructions
Basic locking functionalities and atomic operations, for instance gcc’s __atomic_*
functions or pthread mutexes, are generally implemented using machine instructions that are mainly used in this context (for instance, Intel’s cmpxchg
instruction).
By allowing extra thread switches around these specific instructions we can make it more likely that another thread will be run at this point, potentially exposing concurrency bugs
Configuring Thread Fuzzing in Undo
Thread fuzzing is a feature of Undo (which includes the LiveRecorder tool). It is not included in UDB because the feature is targeted at hard-to-reproduce issues, and so is more effective in automatic scenarios. Why waste engineering time when you can let the machine do the work for you?
It can be enabled with a command-line option, which by default uses all thread fuzzing components:
$ live-record --thread-fuzzing <other args> <process to record>
If you want a more detailed way of controlling it, you can use the environment variable inline instead:
$ UNDO_tf=starve,random,in-bb,sync-instr live-record <args> <process to record>
The UNDO_tf
environment variable is a comma-separated list: you can enable individual components or a combination.
To enable thread fuzzing via a LiveRecorder API session, include undolr_thread_fuzzing.h
and call undolr_thread_mode_set()
with a bitmask of the desired components to enable.
The LiveRecorder tool can be set to record on failure, allowing only the misbehaving instances to be saved. To get the most out of thread fuzzing in your test pipeline, tests which may expose concurrency bugs should be run many times until a failure happens and then the recording can be analyzed to discover the root cause.
NEW: Feedback-directed thread fuzzing (public beta)
In Undo 8.0, we introduced an advanced debugging functionality called Feedback-directed Thread Fuzzing. This is a variant on thread fuzzing that uses a two-phase approach:
- 1st phase to observe which memory locations are modified by multiple threads
- 2nd phase to force thread scheduling/starvation at exactly those points
This offers a powerful way to find intermittent races.