Why you’re probably asking your test system the wrong question

Why you’re probably asking your test system the wrong question

In this series of blog posts, we will take a look at various facets of Undo’s LiveRecorder, starting with a number of scenarios in which LiveRecorder can prove useful. Later, we’ll look at ways of integrating LiveRecorder into a workflow, before moving to a deeper level to explore how to get the best possible experience out of LiveRecorder.  Initially we’ll be somewhat focused on LiveRecorder for Automated Test, but the later posts will apply equally to LiveRecorder for Production.

In a few posts time we’ll look at a couple of specific types of automated testing, and the particular concerns raised therein, but for this article we’re considering automated testing in general.

traffic_light_code_smallAutomated testing has been around for quite some time in one form or another. Engineers have always sought ways to reduce the amount of tedious repetition in their lives, whilst computers are really good at repetition. It was only natural, therefore, to try and harness the tools available by creating tests which could be run programmatically, determining whether the result produced was as expected, or indicative of a software bug.

“venerable projects may be lumbered with a legacy test system”

In the early days, individual engineers or small teams would come up with ad hoc scripts, which would coalesce into something resembling a test system. A few companies saw an opportunity early on to offer such systems as products, and subsequently various open source offerings have come onto the scene. Green field or smaller projects have the luxury of utilizing the latest developments in test systems, or at least benefiting from lessons learnt if building a home-grown system from the ground up. More venerable projects may be lumbered with a legacy test system, with many tests that would be too costly to migrate to another system - tests often get written in a way that ties them in subtle and intricate ways to a particular infrastructure.

Regardless of the system, the main emphasis is always on the binary result: does the test pass or fail? This is fairly natural, and I think arrives as a result partly of the minimum value proposition (if your test doesn’t tell you if it’s worked, it’s useless) but also largely from human nature.

What do I mean by the human nature element? Let us consider why we test. Automated testing is all about catching regressions. (If we didn’t care about regressions, we wouldn’t bother repeating the test; if we’re not going to repeat the test, why go to all the effort of automating it?) This can be, and I think often is, loosely translated into “I want to know that all my tests pass.” This is, after all, the diamond criterion for allowing software to ship. We’ll take as read that the team has made every effort to ensure that the test plan covers all functionality, and therefore “all my tests pass” implies “I have identified no bugs”.

However, phrasing the goal in this way encourages a bias towards the “success” side of the binary result coin. If a test passes, my job is done. What if a test does not pass? Why, then, it is an inconvenient truth that does not fit with the objective. Raise a defect and move on.

“What can my test failure tell me?”

Supposing we rephrased the goal thus: “I want to resolve all issues causing my tests to fail.”  The objective is the same, but we’ve shifted the emphasis off the end result, and onto the means to get there. It’s an important difference, because now we’re thinking about the inevitable failures before they happen. We’re open to the question, “What can my test failure tell me, to help me fix the bug?”

A clued up QA team with a constructive relationship with Development will include information about the environment the test was run in, plus whatever diagnostics are produced by the testee application by default. Maybe some tests will also be run on debug builds, giving extra information such as asserts or diagnostic logging. And some of this information may even be collected automatically by the test infrastructure. All of these are great, but at the end of the day they are just clues to help a development engineer reproduce the issue in order to diagnose the root cause.

Oftentimes, in spite of being armed with all the above clues, a developer simply cannot reproduce the reported failure. It doesn’t matter whether it’s because the bug is sensitive to variations in timing, or because the test environment is subtly different to anything the developer can recreate. At the end of the day, the developer is stymied by inability to reproduce. Even when the failure can be reproduced, diagnosis of a given bug typically requires reproducing several times before the actual root cause is discovered. In a fine example of Murphy’s Law, the number of times needed to reproduce seems to correlate pretty well with the difficulty or time involved in each reproduction.

In a bid to provide as many, and good quality, clues as possible as to what went wrong - and thereby reduce the number of iterations needed to ascertain the root cause - an application will typically include some degree of diagnostic logging. Developers do their best to identify ahead of time what they think will be the most useful information when things go wrong, and to allow that information to be surfaced in an efficient and useful manner. Too much data, and the diagnostic output bloats to the point of being entirely intractable. Too little, and the developer diagnosing an issue is left wishing for more.

Undo’s program recording tool, LiveRecorder, takes that approach and puts it on steroids. It aims to remove the need to reproduce a failure from the path to diagnosis and resolution. Rather than rely on the developer to guess which information will be useful, it captures all nondeterministic events affecting a program’s execution. This allows the recording analysis tool to reconstruct the program’s state at any moment in its history, making all information available after the fact. The developer can explore how the program behaved backwards from the failure, as well as forwards from a known point; in fact to jump around in the program’s history at will.

“diagnose the root cause without reproducing the failure...Not even once.”

Provided the test infrastructure can record the application failing, the development team can use the recording to diagnose the root cause without reproducing the failure for themselves. Not even once.

Let me play you some incidental music while that sinks in…light_bulb_brainWe’ve gone from a test telling us “yes, that works” or “no, that fails” to one which hands all the information needed to do the forensic analysis into what the application under test actually did leading up to the observed incorrect behaviour. It’s rather like the leap from looking at slides under a microscope, to MRI scanning.

I’ve already highlighted the benefits of no longer having to reproduce the failure outside of the test environment. The ability to go backwards and forwards through the recording is also not to be sniffed at. A University of Cambridge study found that, on average, reversible debugging yields a 26% efficiency saving on developers’ debugging time.

That’s pretty good going, coming from a shift in perspective in what we’re asking of our automated test suite.

1At the time of writing, there are certain types of bug that LiveRecorder is incapable of capturing. The folks at Undo continue beavering away to reduce those that elude the technology; none is deemed inherently unrecordable.