Resources

Debugging Part Three: Debug like a professional – Differential diagnosis

How do you decide what is the cause of a bug if there are several possible ideas and the symptoms don’t point unequivocally to one thing?

You use: differential diagnosis, which allows you to see which of the possible diagnoses (debugging the right way) you should concentrate on.

I’m never sure why this mental tool isn’t taught more widely, even down to primary school because it’s so powerful. It involves gathering what you know (the symptoms), identifying the causes which would explain each of the symptoms and deciding either which symptom is most likely or what new information you need to gather to make your decision.

This method is particularly powerful for long-running, hard-to-resolve and hard-to-replicate bugs so it’s important to think carefully about what you’re going to investigate.

It is also a mental tool that many professionals used when debugging without realising it. They race through from what they know to what their next step should be in just a few moments, but it’s worth learning and often worth using for political reasons if a complex bug isn’t being addressed properly.

If you ever watched the TV series, House, starring Hugh Laurie, this is exactly what the doctors do when they “do the differential”. (And it’s never Lupus…)

Here’s how you do a differential in software

1. List your symptoms

These may be the original bug report, information from the logs, how the system used to behave, what is known about latency on the network or anything else you think relevant.

What you know often depends on what systems you have. If you have good logs on production systems, a method for gathering crash reports, or whether you have powerful debugging in your IDE as Jetbrains’ CLion has with Undo [https://blog.jetbrains.com/clion/2016/09/undo-for-clion/], these things will determine how many symptoms you can capture on your list. The more, the better.

2. List the causes which explain each issue

Now put all the possible causes next to each of the symptoms. For example, slow response over the network might be network speed, a security handshake or a slow database on the backend. Or something else.

3. Cross out all the causes which are impossible or highly unlikely

Some ideas put forward in step 2 are just not useful. Sometimes people suggest things for political reasons or to get the problem off their own plate such as “the supplier’s database is always slow” or “power outage”. These may be useful, but it’s worth crossing off the more far-fetched and less useful suggestions for now and revisiting them later if all else fails.

Think Sherlock Holmes: once you have eliminated the impossible what remains, however improbable, must be the truth.

4. The cause with the most symptoms or the most prominent symptoms

Finally, you need to decide which cause you think is most likely based on the symptoms in front of you. Sometimes this is obvious, when all symptoms point to one cause more than any other cause. Other times it requires some discussion and thinking through.

If one cause isn’t more obvious than any of the others, you should decide what new information will help you make that decision. Do you need more performance data? Or do you need to know how the algorithm works in a particular edge case? Use the debugging tools available to you to gather that information and re-run the differential from scratch.

Even when you have all the information available to you, how you decide likelihood often depends on your situation. Some symptoms might be red herrings, some might be intermittent. Either way, listing them and reasoning about which symptoms and causes you are focused on right now helps steer your thinking as you find your way to resolving the bug.

But remember… it’s never Lupus….

It's never lupus