Monday, October 31, 2011

Chasing Technology

Maybe I'm just getting old, but keeping up with technology seems to be getting harder and harder. As I've mentioned before, I am a big fan of learning your libraries, languages, and tools. But how are you supposed to do that, when the libraries keep upgrading and there are constantly new tools to use? Just this past week there was a post on slashdot suggesting that tech skills have a 2 year half life. The comments on that thread seem to mostly disagree with the hypothesis, but there is definitely some truth to it.

The two languages that I use the most at work did not exist when I started college. However, languages are easy to keep up with. They tend to move slowly and are very well documented. Tools, libraries, and frameworks, on the other hand, are a nightmare to chase. They are constantly being upgraded, and often times the best documentation are the question/answers out there on the web at places like stackoverflow. The problem with these sites is that they are only useful after a tool has been out long enough for a critical mass of people to use.

So what's a person supposed to do?

Well, first, as many commenters from the slashdot article stated, the most valuable skills a software engineer has is knowing how to design, develop, and debug software. These skills don't go away just because you are using a new language or library. So play to your strengths. If you design and write clean code, it'll be easier to fix/update as you learn the language/library/tool better.

Second, choose your upgrades carefully and intentionally. Just because there is a newer version of something out there, doesn't mean you need to use it. Are the benefits worth it to upgrade now? Does it make more sense to wait until it has been popularly adopted? Or maybe it makes sense to skip a version or two.

Third, don't become an expert in everything. Most of us have finite amounts of energy and time. So, while I'm a big believer in being an expert in the tools you are using, that isn't always practical. If a particular tool/library/framework is bound to be upgraded/replaced in the near future, it's probably ok if you use the existing library in a less than optimal way. Focus your energies on the timeless (design and coding skills) or the long lasting (languages).

constantly learnFourth, and most importantly, constantly learn. If your employer doesn't provide opportunities for professional growth, take it anyway. Just as you don't ask for permission for every single test that you write, or every whiteboard design that you draw, you don't need permission to explore new technologies and tools. Whether they know it or not, your employer pays you to be a good and competent software developer. This requires you to be constantly learning new and better ways of doing things. If you are not taking ten to twenty percent of your time exploring and learning, you are shortchanging yourself and making yourself obsolete. I realize that it seems there is never time to do this, but like other aspects of good software engineering (appropriate design, testing, documentation), in the long run you will be worse off if you don't make the investment now.

Monday, October 10, 2011

Vexing Bugs

While bugs are a part of development, there are a few types of bugs that I find particularly vexing: intermittent bugs, library bugs, and bugs that only happen in production.

Intermittent Bugs
It's annoying to perform the exact same set of steps 4 times, have it work 3 of the times, and fail once. When Murphy has his way, it works when you, perform the steps or are watching, but fails when the user does it on their own. How do you diagnose a bug that you can not reliably reproduce?

Library Bugs
The majority of library bugs aren't actually bugs in the library, but rather a bug in how you are using the library. Either way, having cryptic error messages coming out of the bowels of a library, when there is no apparent connection between the error and your code, is not fun. I find the process of googling error messages or library usages to be much more frustrating than tracking down errors that are entirely in my own source code.

Production Bugs
While we strive to have our development environment similar to the production environment, there are certain discrepancies that always creep in. Things like debug information, optimization level, database source, etc. When in development, you don't want to be messing with production data, and your willing to give up some performance to better track the code. But since it isn't the same, code that works fine in development doesn't always work in production. Ugh.

This Week
So what prompted this post? Well, this week I had a bug that was at the intersection of these bugs. We had deployed a new version of our application, and suddenly I started getting server emails about errors. It was some cryptic error happening within the library we use for making Web Service calls. And, of course, the errors only happened sometimes. Even more frustrating, when trying to reproduce the error on my local box using WEBrick in development mode, the error never happened.

After setting up a development server on my local box that was configured just like the production box, I was able to intermittently reproduce the error. Using my standard debugging technique of adding printfs, I eventually noticed output that looked like:

Starting Web Service Call 1
Starting Web Service Call 2
Exception Caught from Web Service Call 2
Ending Web Service Call 1

Unfortunately, it didn't catch my eye right away, but I eventually noticed that every time there was an error, the web service calls were getting interleaved. i.e. Web Service Call 2 was starting before Web Service Call 1 started. And now it suddenly made sense. 

The two web service calls were being made as the result of two separate AJAX callbacks from a web page. Since the the two web service calls were being done in two different web requests, they were being handled in parallel resulting in a race condition. As it turns out, the web service library we were using is not thread safe. However, apparently WEBrick in development mode was serializing these requests (i.e. not processing 2 until 1 was completely handled), and so in development there were no threading issues. 

As with most bugs, once the problem had been diagnosed, it wasn't that hard to fix. In this case, we put the web service calls in a critical section, forcing them to be serialized. For our particular use case, the higher latency of serializing the web service calls was acceptable, allowing for a fairly simple solution.

This post wasn't really intended to have a moral. It was mostly just me talking about my week, but I suppose there are a couple good ideas.
  • If you think its a library bug, you are probably misusing the library. (i.e. Learn your libraries)
  • Intermittent errors are indicative of threading issues.
  • Know the differences between your development and production environments.
    oh, and maybe most importantly of all
  • When frustrated, take a break. I didn't actually track down the bug until I walked away from it for a while and then came back to it with fresh eyes.