Databases are the new electricity - they are invisible, power everything, and we assume they'll just work.
Databases underpin the whole of the modern world by providing structured storage of huge volumes of facts. Everything from how you feel about cat videos to the balance on your trendy new app-based bank account is stored in databases. But designing and building the underlying database engines is inherently difficult and bugs can be hard or impossible to reproduce and fix.
This is attested by unnerving comments like "We can't say this is fixed for sure" in the source code history of a leading open source database which powers the world. But alongside these comments is evidence for how engineers solve or work around hard to reproduce bugs.
In this article, we look at why database engines are a source of hard-to-reproduce bugs and we propose some common strategies for solving them. To do this, we've crawled through forum posts and source code histories from open source database engines looking for the tips, tricks and tools engineers use to keep the lights on.
Lots of concurrency all the time
Databases are inherently highly-concurrent systems where many, possibly millions of users may be accessing and updating records at the same time. Many strategies have been employed to manage the trade-off between performance and consistency, but the concurrent nature remains an underlying cause of bugs.
This is because concurrency is not the friend of easy debugging and, while some developers cite using languages like Clojure to prevent mutating data and reducing complexity, this isn't always an option. Despite strategies for concurrency such as locks and thread safe code, concurrency will probably always bring hard-to-reproduce bugs.
To solve these bugs, engineers create automated tests which step through specific sets of queries to reproduce known issues. This includes creating tables, inserting data to querying and getting expected results; but even these can't capture the uncertainty and variable performance of different infrastructure and use cases.
You can't see the data
Like concurrency, lack of information is also not a friend of debugging. As the developer of the database engine, you live several steps removed from how the database is being used, including having no access to the data running on the database instance.
This means that reproducing bugs exactly as they occurred is often just not an option, as users of your engine aren't going to give you their data and even the data structure might be commercially sensitive.
The solution often adopted here is to create mini-versions of the full database and examples of the query which are similar to what happened. But there's a problem: in order to get to this, the production database engineer is going to have to investigate and translate end-user reports into reproducible steps and then into a good report for the you.
It implies that to debug a serious low-level problem in production, a production engineer has to do the investigation and put in the time to debug it, which isn’t always possible.
Worse than that, the changes put in place intended to fix the issue can’t guarantee to work. Commit logs of a certain widely used open source databases reveal voodoo fixes: It might work but "we can't say for sure".
Crashes and a lack of logs
To compound the lack of data issue, it's common practice to disable certain types of logging in production environments further limiting the information available to reproduce the issue. In this case, engineers extrapolate from trace evidence in what logs there are and from user reports, but neither provide a wealth of detail.
The strategy in these intermittent bugs which can't be reproduced sometimes lead to voodoo fixes consisting of a change which _should_ fix the issue as understood, but there's no way to know if it fixes the real root cause.
Different environments, different behavior
Even if you have the same data as in production, that doesn't mean you're going to be able to reproduce the bug.
Database engines are run in many environments from physical machines to virtualized environments in public and private cloud, and embedded. Beyond this, the machines, networks, libraries and environments will differ in almost every deployment.
These differences create yet another underlying cause of hard-to-reproduce bugs and debugging pain for database engine developers.
Trawling support forums and commit logs reveals the story of the same strategy tried time and again to work out what's going on in production: get yet more detail of the production environment in the hope of reproducing the issue in development.
After an exchange of ideas and trying things out, you can find evidence of inconsistent environments causing problems coming from I/O performance, network performance and random number generators. Some of this comes across as guessing, albeit educated guessing.
Debugging databases is hard
We've seen a handful of the strategies engineers use to fix hard-to-reproduce bugs on database engines: recreating mini-versions of the production database, reading as much as possible into each log entry, replicating the production environment...Each of these strategies are essentially about getting more information.
An efficient way to obtain a lot of that information is by using software flight recording technology. This technique allows you to record database failures ‘in the act’ (either in Dev, Test, or Production). The recording serves as a stand-alone reproducible test case which can be replayed in a reversible debugger to step backwards or forwards through the code and see the contents of any memory location at any point in the program's execution history. No need to spend weeks – or worse fail – to replicate the issue: you record once, and get straight to debugging by replaying the recording as many times as you like.