Evolution is not an evenly paced process. Rather than happening continuously at the same speed, the earth’s ecosystem tends to be relatively stable over the eons, periodically disrupted by rapid change over a few brief millions of years. Sometimes this is triggered by outside events, such as the meteorite that killed off the dinosaurs, while other shifts are less well understood.
It seems to me that cultural and technological changes tend to happen in brief bursts on human timescales too. Mainframes and minicomputers in the 1960’s and 70’s suddenly gave way to desktop computers in the 1980’s, which were suddenly displaced by mobile platforms (I suspect we’re coming to the end now of the shift to mobile -- i.e. the battle is over and mobile has won -- but who knows what will happen in the next few years?). The culture of software development exhibits the same pattern. For example, during the 1970’s there was a relatively sudden shift to adopt higher level programming languages, in the late 1980’s IDE’s and interactive debuggers became widely used. (Pioneers were using these things much earlier, but I’m talking about mass adoption here.) Between, say, 1990 and 2010 the way we created and deployed software didn’t evolve radically, despite the external transformation to IT that the growth of the internet brought. Of course languages and styles came into, and went out of, fashion, but a developer from 1990 could easily recognise the landscape twenty years later – the tools and processes he or she had been using would still be there, pretty much the same, just newer and shinier. For example, Visual C++ 1.0 was released in 1993. Seventeen years later, the now-rebranded Visual Studio 2010 was a lot slicker and richer, with a much more beautiful UI, but it was basically the same thing.
Move forward into the current decade and the world of how we develop and deploy software is now changing incredibly fast. In 2010 lots of teams had already adopted practices such as Agile, Test Driven Development (TDD), and even Continuous Integration (CI), but the vast majority of development teams were still doing things the old fashioned way, with big Product Requirement Documents, year or multi-year-long development cycles with a multi-month code-freeze and QA period at the end. A shockingly large number of software projects didn’t even have any kind of comprehensive regular regression testing. Five short years later, almost everyone uses some kind of Agile with a decent test-suite, and those releasing on a yearly or quarterly cadence are the laggards, firmly in the minority; most software projects release or deploy new versions multiple times per month and some even several times per day. By moving from the old, waterfall style to Continuous Integration (CI), Continuous Deployment (CD) and embracing DevOps, new features are introduced faster, delivering competitive advantage and responding more quickly to user needs. Gartner predicts that 25% of Global 2000 organisations will be using DevOps by 2016. A software engineer from 1990 would feel right at home if transported to a typical development team in 2005. An engineer from 2005 transported to 2015 would recognise very little.
The wider digital economy has of course been undergoing huge changes at the same time. Software is ever-more vital to the world we live in. Whether in cars, planes, the Internet of Things, wearables, or our smartphones, the shift to cloud and mobile as the dominant computing platform is at least as big a transformation as mainframes to desktops. And all the time, software, as they say, is eating the world.
These modern practices (Agile, Test Driven Development, Continuous Integration, Continuous Deployment, DevOps) have combined with new languages and platforms to allow software to be produced at an incredible rate. But more software means more complexity and the result is daunting. As software controls more and more of the world around us, programs no longer operate in isolation, but interact with other software and run on multiple devices in an increasingly intricate ecosystem. Modern practices didn’t make life simpler for long; rather they allow us to do more and so complexity increased. Research from the Judge Business School at the University of Cambridge found that the annual cost of debugging software had risen to $312 billion. The study found that, on average, software developers spend 50% of their programming time finding and fixing bugs. It seems that solutions to problems lead to new problems. It’s like hanging wallpaper – you remove a trapped bubble of air from one area, only to see a bubble pop up elsewhere.
The catch: TDD and CI mean more testing and that means more test failures
Test Driven Development means (or at least should mean!) more test-cases. Continuous Integration means these richer, fuller test-suites get run more often. The cloud and elastic compute resource means that there is no limit to the number of tests that can be run. A software project of a given size can easily run two or three orders of magnitude more tests every day than the equivalent project would have run ten years ago. Which can only be a good thing, right? Except, of course, for all those test failures. It’s no good arguing that all tests should work -- after all, if tests never fail, what’s the point in having the tests in the first place? If many thousands of tests run every hour, and 0.1% of them fail, triaging these failures can quickly become a nightmare. (And a failure rate as low as 0.1% is rare; I have spoken to companies where more than 10% of their overnight tests fail.)
In my experience most teams have so many test failures that they don’t have time to investigate them all. Typically, the majority of the failures (let’s say nine out of ten) turn out to be benign: it’s a bug in the test itself as opposed to the code it’s testing, a problem with the test infrastructure or some esoteric combination of circumstances that you know will never happen in practice. But lurking somewhere in those ten failures is the one that does matter: the kind of bug that will eventually cause a serious production outage or a security breach. And the only way to know which is the one in ten that matters? Investigate until you properly understand (which in practice usually means fix) all ten failures. In practice very few of us have the time to investigate all of the failures, and so we essentially play Russian roulette with our code.
In the context of the hundreds or thousands of test failures that a mid-size software company might experience every day, you can see the scale of the problem. All these test failures are in fact a sub-problem of the wider issue that all this software we’re creating is so fantastically complicated that no-one really understands what it’s actually doing. Now to be clear, I think it’s beyond question that Agile, TDD, CI, CD and DevOps are all good things. Software engineering as a profession has made great advances over the past decade, of which we programmers should collectively be proud. But these advances have ushered in new problems (or at least exacerbated old ones) to which we must now turn our collective attention.
Software development teams need to take a step back and find ways to understand what their software actually did, rather than what they thought it would do, and then seamlessly to feed that information across the team so that it can be evaluated and the failures fixed. The good news is that technology can help: a number of new technologies and tools are becoming available (including from Undo where I work), that can help developers to understand what their code is really doing, both under test and in production. I firmly believe that bridging this understanding gap is one of the major challenges that the industry needs to solve during the remainder of this decade if the rising tide of test failures and general complexity of software is not to slow or even halt the increase in the pace of development we have witnessed in recent years. No single technology or technique is going to make the problem go away, but I believe the next generation of tools can and will help.