Resources

Debugging is going to get more complex

Complexity of debugging

All software systems are becoming more complex and interconnected. Simultaneously, end users expect software to become increasingly coherent and simple to use - a situation noted by IBM in 2013 and which is even more true of today.

This is hardly fair.

On the one hand we, the builders of software and architects of this brave new world, are having to wrestle (expertly and without complaint, I might add) legacy systems, new cloud services, embedded and IoT devices, and business systems so that they play nicely together.

And on the other hand, the user wants to tap one button and have, within one second, the answer to “where, nearby, is there a Doctor who can help with my specific medical condition which has just been reported by my IoT health monitor? I need this between 2pm and 4pm today and want someone who is private but doesn’t charge too much”. Oh and “I’ve tried these Doctors already and don’t like them”. Oh and “They must be reviewed well on two or three separate review sites”.

"The complexity of software is escalating"

Simultaneously, I might add, the desire for ease of use is steadily increasing while, simultaneously, the complexity of the software is escalating. For some reason, have your cake and eat it appears to be the industry standard for software development.

So we work at getting the systems playing nicely together, orchestrating huge quantities of computing power and mashing vast amounts of data. We write software to do all this for us and we deploy it to devices and servers and serverless environments.

And it works like magic. And then...

The magic wears off almost immediately.

“Can we preempt the user” by linking directly to their health monitor?”

“Yes... yes we can … let me just …”

The fact that we provide these answers so rapidly becomes taken for granted, so we need to bring in more systems and more data to create new value for the users (as well as features for our products). Each time we coordinate these systems, the expectations of users and of the business grow. More and more, they just expect software to work, and to work now!


Challenges of debugging

Yet there is a hidden and significant cost of maintaining multiple highly integrated systems. With each new system, the number of potential relationships increases, as does the complexity and challenges of debugging. In the R&D stage or development stages, this can slow down development, but in production, this not only slows down development: it can cause the system to become so complex you can't get the information you need out of it (if you even know what information you need in the first place).

When something bad or unexpected happens in a network of interconnected systems, the people working on it have access to a subset of the information because they almost never control all the nodes. Many proprietary systems, IaaS and business platforms record some, but not all of the activity. And most legacy systems don’t log anything much of any kind.

In the highly networked and integrated world, how do you ensure software quality? How can you solve problems rapidly and minimise the impact of in-production intermittent failures? How do you know what the hell is going on?


Record, rewind and replay

We cannot manage what we can’t measure, and we cannot measure what we cannot record.

You need to start scientifically. Record everything you can. You start by ensuring you can reproduce problems easily or, more elegantly, that you can record what happened in the first place.

Of course, you cannot record all events on all nodes in our network, but those which are under our control should provide logging down to the most detailed level possible. Recording anything and everything available on as many devices as possible in production radically reduces the complexity of bugs in a highly integrated system. From the information you’ve gathered, the team charged with fixing the problem can extrapolate and reason about what really happened. In Undo's case, developers can analyse the problem through using record, rewind and replay via its Live Recorder and UndoDB technology.

As the users demand the moon on a stick, so technical teams need to demand all the information available. Without it, the system becomes too complex and the users won’t get their magic.

The Moon on a Stick