Resources

4 ways developers fix hard-to-reproduce bugs in database engines

In a previous post, we looked at why debugging databases is hard from the concurrency, lack of access to data, to the default lack of logs in production. Building on this, we now look at the methods employed by developers of database software to solve those hard-to-reproduce bugs by looking at some examples from open source databases including including MySQL, Postgres, SQLite and MongoDB.

To recap our previous post, there are a few big reasons why database development is particularly prone to hard to reproduce bugs:

  • Lots of concurrency, all the time
  • You can't see the production data and possibly not even the production database structure
  • Crashes might not give an output and production systems often lack logs
  • Different environments elicit different behavior

But it’s not enough to say that bugs are difficult, they still have to be fixed. And so engineers working on database engines have to find ways forward; so what do they do when faced with these difficult bugs?

4 current debugging tactics used by software developers. Are they effective? 

Ask for more information

As a database vendor, you want people to be on the latest stable or relevant version to them and you really want upgrades to be a happy thing, bringing bug fixes and new features. But occasionally, an upgrade introduces new behaviors.

One that we found was a CPU spike just after an operation system and database upgrade. The issue was pretty weird - a query which worked before, didn't work after and caused the CPU to spike to 100%.

And, of course, it couldn't be reproduced.

This kind of bug causes headaches on both sides: the production engineers (customers) are left wondering if upgrades are a good thing, while the database engine team know that something is wrong but can't reproduce it nor fix it.

This was left hanging with a request for create tables and insert commands to reproduce the effect, but was probably fixed in another thread somewhere.

So, the first technique we found was: don't fix it. Ask for more information and wait. Whether this is a best practice or not will create some fiery debate, but the point here is that this is a method engineers are using. You have to ask yourself "is this the best way?"
 

Voodoo fixes and reading between the lines

In another example, updates to the database somehow prevented the engine from being started automatically causing the database to shutdown. The report suggested that the database team put some basic tests in before deploying updates to the package. It's hard to read whether this is genuine advice or being deliberately patronizing!

Either way, the team behind this wasn't able to reproduce the error, but they did find a clue. After the upgrade, they thought it could be a timing issue where the running database hadn't finished and so the new version couldn't start up.

So, based on reading between the lines (or guessing, if you want to be less charitable), they put in a bit of a delay.

These become what you can think of as voodoo fixes: fixes which might work or might not, but it's impossible to test.

Stop the symptom, but still not sure of the cause

In another case, the database was deployed as part of another open source project in which a bug was reported  causing an infinite recursion when doing an `insert or replace`. Neither the open source project nor the database team could reproduce the bug, but they did know that it was an infinite recursion.

The solution was to avoid the recursion entirely which fixed the symptom but not the cause. The difference between this and the previous issue is that you can test the symptom (doing an insert or replace) and the bug is fixed, but we don’t know why the program ended up in an infinite recursion.

Similar to the previous fix, the engineer used what she had and put in place something which fixed it but this time by avoiding the problem. That should have closed the other project’s issue but it didn’t. The commit went into the database source, but this didn't satisfy everyone; without being able to reproduce the specific bug, they were nervous about just putting a fix into their project.

This raises yet another interesting issue: if your database product is part of a larger 'parent' product, then it might not be your role to determine if the bug is fixed or not. For hard-to-reproduce bugs, this is going to be a drain on resources from both sides and increasingly frustrating.

The fix in this case fixed the bug as seen in the error logs, but there's no way of knowing what the effect of the fix is downstream for the user.

 

This fixes is a bug, but perhaps not the reported bug

In another example, a bug is fixed with the patch, but it's not clear whether the bug that's been fixed is the bug that was reported. In this example, the engineer has been led to a genuine bug but the original might still be present.

The commit message admits that this is a 'blind fix' so, if you were signing off the bug, would you deem it fixed or still open?

 

What we've learned

We knew from our previous research that databases are sources of hard-to-reproduce bugs and, because they store the state of all software, they are a valuable thing to keep right.

Knowing the main causes of challenging bugs in the development of data management systems, we wanted to better understand what engineers do to fix them?

Sometimes, engineers get to the underlying cause of the issue, but this isn't always the case; and so, while our research was anecdotal, it seems that the main method of fixing these is...guesswork.

The main question we ask as we look at these bugs and fixes is: are these bugs actually fixed? Or will they come back?