Managers of software development teams need to get on top of test failures

Recording your program’s execution should be the standard

If there is one thing that software engineers will agree on it is this: there is no such thing as bug-free software. This is true of the most simple programs, such as a simple calculator application, to the most complex: for instance, a massive large-scale multi-threaded database that powers cloud services across the globe. Complex programs with hundreds, if not millions of lines of logic, with their execution all closely entwined, means that any piece of shipped software will always contain bugs, many of which appeared in QA and were allowed to go into production.

Nor is this new. For years, managers of software development teams have made trade-offs between the pressure to ship features and code quality: should they spend extra time trying to fix that really annoying bug that only appears once in every 300 runs or should they stick to the software delivery schedule?  Most often, the failure will be tossed into a pile of undiagnosed tests that becomes a backlog and these failures will at some point raise their head. The Economist, for example, writes that some of the cleanest software ever written – by NASA's Software Assurance Technology Centre – contained 0.1 errors per 1,000 lines of source code.

Yet what if that bug that you, as a QA manager or engineer, decided not to fix is a disaster waiting to happen? What if it is a glitch along the lines of the one that forced BA to ground all of its flights in September 2016 and again in May 2017?  How can you possibly tell whether one of your failed tests will result in the loss of a $10 million customer - tests that were your responsibility to fix and which you decided were of lower priority than shipping on time?  The truth is you can’t. And, the potential impact of software defects on any vertical that sells software is growing. It runs deeper than the face value of mopping up a disaster, which is painful and stressful for all concerned. It means plummeting stock prices, loss of market share and evaporating customer trust and loyalty. Tricentis, a continuous testing platform vendor, launched its Software Fail Watch report in January of this year to highlight the problem. While analysing 606 failures, it found that over 3.6 billion people had been affected by these software problems, which resulted in $1.7 trillion in lost revenue to software vendors.

These hidden disasters are lurking in the products of companies that ply their trade in selling software, much of which has been built on decades-old code written in C and C++.  Here, database vendors are in the spotlight. Numerous acquisitions and code merges over the years have led to legacy code which has a tendency to misbehave. The same inputs can result in different executions on different occasions which, of course, is a nightmare for database vendors to find and fix before the issue results in one (or more) unhappy customers.

Pinpointing bugs is itself a challenge and a seemingly impossible one when they are virtually irreproducible. During the testing phase, these bugs may only subtly affect the program (if they appear at all) and barely display any effect on the outputs. But, when they manifest in production, the consequences can be severe for businesses - such as when Salesforce’s CEO had to apologise directly to US users when a file integrity issue made the database inaccessible for days. Or, such was the case when Amazon’s database couldn’t handle a slight database disruption which consequently caused outages throughout the Amazon network.

This problem is so serious that one wonders why more stringent regulations aren’t placed on QA? Yet, at the same time, vendors must deliver software on time. If competitors can prove that they can get there before you, you lose all advantage.  

The solution is for QA and software development managers to couple rigorous testing with rigorous debugging. The revolution in testing has already happened, as thousands of automatic tests can be run simultaneously in an attempt to test code from many angles, but the debugging revolution is only beginning.  If you are a manager that doesn’t want to be in the firing line for the potential loss of a major customer, consider your debugging strategy and what you can do to be more confident that you’re not releasing a disaster.

The ability to capture and replay program execution is one solution to the problem of irreproducible test failures and the only viable means of ensuring that you’re not releasing a disaster.  The premise is simple. You can take an exact recording of a program’s execution so that you can capture an exact replica of a failing run. A recording represents a 100% reliable reproducible test case that offers total visibility into all the factors that led up to (and caused) the crash. This means you no longer need to fear that a sporadically failing irreproducible test might mean the loss of a $10 million account because you know that the failure can be captured and fixed before making it into production. This is where replaying the recording comes in.  Rather than stepping line by line through code to try and identify the exact piece which failed, a better method of interrogating the program is available; one that maximises efficiency and allows developers to debug quickly.

Static and dynamic analysis tools can detect certain classes of problems - for instance, they can help developers find implementation bugs - but they can’t detect all of them. For example, they are of no use for more serious bugs in runtime code, for which only traditional debug methods remain - such as core dumps and log files. Recording and replaying program execution is the obvious sequel to the testing revolution.  It should become the new standard at which debugging protocol is set if managers truly want to prevent the next disaster.

Better yet, if recording tools were more widely used, new and emerging technologies and industries would be able to share learnings and best practice, ensuring the industry tackles costly and potentially dangerous failures together.