What the actual heck just happened?

I wrote in an earlier post about using LiveRecorder to get more information about a failure from automated tests. That was about automated testing in general; in this article I’ll focus on a sub-category of automated testing, in which randomness is deliberately thrown into the mix.

The advent of automated testing moved the human bottleneck from running tests, to designing them. Machines could run the tests we set them over and over again, with high fidelity, but they were limited to checking cases that we humans had concocted. To be sure, there are ingenious test writers, and a bit of white-box testing allows them to hit cases that a black-box tester never would. Yet the full map of code paths remained largely unexplored.

Enter the automated random element. In essence, we take the machine’s ability to rapidly run test cases, and allow it to make its own script.diceIn order to be productive, the randomization needs to be somewhat guided. Most combinations in the entire phase space of a given function’s inputs may be invalid, repeatedly exercising the same trivial error-handling path (after the first few such cases highlight the initial lack of error handling, maybe). The randomization must be loosely constrained to cluster around the hypervolume of sensible inputs. I say “loosely” because in addition to deviating slightly from validity, to test bounds checking closely, you also want some tests to ping off into deep phase space in case there are some bizarre realms that are mistakenly interpreted as valid when they’re not.

There’s some interesting work in the field of getting the machine to deduce sensible constraints for itself, e.g. Godefroid, Klarlund and Sen, who also explore getting the randomizer to guide its own white-box exploration. The very existence of their paper is a sign of the complexity even of thoroughly unit testing a modest component. The situation is difficult enough if your component is stateless; once you add in statefulness your explorable phase space grows exponentially.

The final compounding factor is when we move up from unit testing, into integration and system level testing. Now there are multiple components working together which compounds statefulness. To explore the vast atlas of code paths now available, we not only need the automated exploration that randomness provides, but we require the test system to go on a possibly lengthy journey to explore that part of the space opened up by statefulness. This is why we have soak tests, with randomised workload generators and so on.

That such systems are effective in identifying software bugs is beyond doubt. From one perspective, they are almost too good at it and can generate a growing backlog of issues. Clearly knowing about the existence of a bug is better than not; it’s the first step towards fixing the bug. However, having a growing backlog of known issues can be demoralizing. In a bid to get a handle on the situation it is common to seek to categorize such failures, grouping similar failure modes together. Subconsciously I suspect this is also something of a comfort blanket. We can pretend there aren’t as many bugs as it appears if that lot over there might have the same root cause.

But why does the backlog grow in the first place? You should just get your devs to fix bugs as they’re identified, right? The reason that doesn’t always happen lies in the nature of the test. It’s very much like an issue encountered by a customer, except that in this case the “customer” isn’t upset that they’ve hit a bug (unless they keep hitting the same one, in which case the utility of the test system is nullified), and the “customer” usually has a much better recollection of what they did leading up to the failure, often in the form of a random seed.

I’ve previously discussed the difficulty of constructing a reproducer for a failure found in manual testing. The main difficulties in constructing a reproducer for a failure found by automatic randomized soak testing are very different. Here we know, or can in principle reconstruct, exactly what the test harness did to drive the product. The trouble is that what it did may be “a lot”. How much of that is relevant? How much of the statefulness generated by what went before is relevant to the latest inputs? In other words, where in the phase space of all possible inputs and states are we?I hope you didn't spend time looking for a bug in this picture...

Good luck finding the bug!

These are exactly the difficulties encountered by the SAP HANA team when they implemented their fuzz tester.

By integrating LiveRecorder into the test framework, you can catch your product red-handed. A developer no longer needs to pick through the detailed logs of what happened to try and identify the salient points. It’s all in the recording, and you can unravel the provenance of incorrect data causing a failure, going backwards in time until the root cause is unveiled. Typically it’s not all that far back in time, which means you don’t have to go on the whole journey from clean state to discover whatever state caused one innocuous input to blow up. And, critically, you don’t have to speculate and iterate until the same arduously repeated failure finally gives up the vital clue.

Having a recording of program state over time makes bugs that were to all intents and purposes unsolvable, solvable. There is no longer an ever-growing backlog of intractable tickets, and the test infrastructure you invested all that time in is free to move on to find ever more weird and wonderful corner cases in uncharted phase space territory.