It's annoying to perform the exact same set of steps 4 times, have it work 3 of the times, and fail once. When Murphy has his way, it works when you, perform the steps or are watching, but fails when the user does it on their own. How do you diagnose a bug that you can not reliably reproduce?
The majority of library bugs aren't actually bugs in the library, but rather a bug in how you are using the library. Either way, having cryptic error messages coming out of the bowels of a library, when there is no apparent connection between the error and your code, is not fun. I find the process of googling error messages or library usages to be much more frustrating than tracking down errors that are entirely in my own source code.
While we strive to have our development environment similar to the production environment, there are certain discrepancies that always creep in. Things like debug information, optimization level, database source, etc. When in development, you don't want to be messing with production data, and your willing to give up some performance to better track the code. But since it isn't the same, code that works fine in development doesn't always work in production. Ugh.
So what prompted this post? Well, this week I had a bug that was at the intersection of these bugs. We had deployed a new version of our application, and suddenly I started getting server emails about errors. It was some cryptic error happening within the library we use for making Web Service calls. And, of course, the errors only happened sometimes. Even more frustrating, when trying to reproduce the error on my local box using WEBrick in development mode, the error never happened.
After setting up a development server on my local box that was configured just like the production box, I was able to intermittently reproduce the error. Using my standard debugging technique of adding printfs, I eventually noticed output that looked like:
Starting Web Service Call 1 Starting Web Service Call 2 Exception Caught from Web Service Call 2 Ending Web Service Call 1
Unfortunately, it didn't catch my eye right away, but I eventually noticed that every time there was an error, the web service calls were getting interleaved. i.e. Web Service Call 2 was starting before Web Service Call 1 started. And now it suddenly made sense.
The two web service calls were being made as the result of two separate AJAX callbacks from a web page. Since the the two web service calls were being done in two different web requests, they were being handled in parallel resulting in a race condition. As it turns out, the web service library we were using is not thread safe. However, apparently WEBrick in development mode was serializing these requests (i.e. not processing 2 until 1 was completely handled), and so in development there were no threading issues.
As with most bugs, once the problem had been diagnosed, it wasn't that hard to fix. In this case, we put the web service calls in a critical section, forcing them to be serialized. For our particular use case, the higher latency of serializing the web service calls was acceptable, allowing for a fairly simple solution.
This post wasn't really intended to have a moral. It was mostly just me talking about my week, but I suppose there are a couple good ideas.
- If you think its a library bug, you are probably misusing the library. (i.e. Learn your libraries)
- Intermittent errors are indicative of threading issues.
- Know the differences between your development and production environments.
oh, and maybe most importantly of all
- When frustrated, take a break. I didn't actually track down the bug until I walked away from it for a while and then came back to it with fresh eyes.