bug-hunt

How Do You Fix An Impossible Bug?

Within the span of an hour, it had all gone to hell.

The first deployment went rather smoothly. It was a fix to an existing web service, and went out with no problems, or so we thought. Within ten minutes of the deployment, the users started complaining of a minor bug, one that was seemingly omnipresent but didn't really stop them from doing meaningful work. The team which had sent out the deployment immediately set to work figuring out what was going on.

Unrelated to that deployment and forty minutes later, my team was launching a major change to a web site that was consuming that other team's web service. When our change went out, the bug that the users had been complaining about from the web service deployment vanished, replaced by a major bug that caused them to be unable to do any work at all. Naturally we were a little concerned.

The major bug caused an error message that included this statement:

"Unknown server tag 'asp:ScriptManager'."

Now usually when I see an error message like that, I think the project is missing a reference. But no, the references the project needed did, in fact, exist. The second thing I think is that it was using the incorrect version of .NET (in this case, 2.0). Wrong again, it's using the proper version. So now I'm a bit stumped; I pull one of my developers off his project to work on this, and he and I go to our manager and the web service team to try to hash this out.

It took the five of us about an hour to work out where exactly the problems were. As so often happens with major problems like this, it wasn't one problem but several, all intersecting at one time. They snowballed like so:

  • The previous day, my group attempted to roll out the major change to the web site. The roll out didn't work, and another manager (who had previously owned the project, and unbeknownst to me) immediately copied the application to a different location on the same server. He figured this would solve the problem, as it had before with a different app; it didn't.
  • Before the web service change went out, the users had already been notified to use the new location. Consequently they started complaining about the major error.
  • When the web service change was deployed, a different set of users within the same group complained about the minor bug, as word had not reached them to use the new location.
  • When our web site change went out (to the original location), the users of that site noticed and now complained of a "new" bug at the old location, despite the fact that it was the same bug as the bug at the new location.
  • All of this taken together meaning that the fact that our web site and their web service were related was a coincidence. The fact that the two deployments went out so near to each other had nothing to do with what the actual problem was. Coincidences like this are the worst possible thing that can happen when trying to find a bug.

Got all that?

Ultimately we worked out the problem. Well, really, we stumbled onto it. How we got there was such blind luck that I'm not convinced we actually solved the problem so much as lucked into a solution.

A bit of googling for the error message revealed this StackOverflow answer which states that the reason we get the above error is that a piece of the web.config is missing. Here's what it should be.

<pages>  
    ...
    <controls>
        <add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=3.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
    </controls>
</pages>  

We had previously confirmed that in our application configuration that line did, in fact, exist. Obviously this could not be the problem. (I apparently forgot about what happened the last time I assumed something was true.) Later, when we started getting more desperate to find the source problem, I had our server team give me a copy of the app running on the production servers. This is what I found:

<pages>  
    ...
    <controls>
        <!--<add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=3.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>-->
    </controls>
</pages>  

Yes, you read that right. For some damn reason, the offending line of configuration code had been commented out.

And we had no idea how this happened.

We tried examining the config transforms; there was nothing that would result in that line being commented out. We tried looking at the server logs, the build logs, the source control history, anything that might give us a scrap of information as to how the hell this happened. We found nothing.

As you might imagine, this was a little frightening. This line got commented out, and we couldn't reproduce how. How can you fix a bug that should have never occurred in the first place? When we deployed the corrected configuration file, it worked, of course. But in the meantime we had wasted nearly an entire day looking for something that should have been impossible.

But was it impossible, or did we miss something? I'm inclined to believe the latter. One of the things I stress to my team when they come to me with bug reports is the important question of what changed? What changed between when the system worked and when it didn't? Was it business rules, data sources, the build process? If we can determine what changed, the time needed to pinpoint the actual bug shrinks dramatically. In this case, either we simply didn't know what had changed (the most likely scenario) or nothing had changed (the far scarier scenario). Either way, something was off and we couldn't determine what it was.

What was even more worrisome was that there had been a minor bug reported before the major bug showed up, the one that was annoying but not work-stopping. That minor bug was not reproducible now, so it's entirely possible it's still lurking out there, waiting for some unsuspecting user to click the wrong button or enter the wrong date and then blow up the whole system. We don't think it will be that serious, but we also don't really know that it won't. We're stuck in bug-induced purgatory.

That's a terrible feeling. Something went wrong, and you can't figure out how. You know what the solution is, but you don't know why.

I suppose I should just be happy we figured out the problem. I am, sort of. And yet I am forced to conclude that there exists a bug which caused a critical part of our application configuration to be commented out for no discernible reason. A part of me still wonders: how can I find and fix that bug, a bug that should have never existed in the first place?

And what if it happens again?

What about you, dear readers? What are some bugs you found that you couldn't source? How did you fix these "impossible" bugs? Share in the comments!

Happy Coding Debugging!

Post image is 2008 broken computer monitor from Wikimedia, used under license.

Ancient ORACLE, Modern IIS, and a Failure to RTFM

My coworkers Rajit and Dave and I have been spending an inordinate amount of time over the last few days looking for a really annoying error in one of our apps. The exception message reads like this:

Could not load file or assembly 'Oracle.DataAccess' or one of its dependencies. An attempt was made to load a program with an incorrect format.

Yep, that's right, we're using ORACLE as our database. I assure you, this is not by choice. We've known for a while that our version of ORACLE is so old that it can remember when it played with two sticks and a rock (and was happy, dammit). Problem was, we assumed that it had to be causing this issue, though we ultimately determined that it wasn't responsible. Even better, in the process of squashing this bug we discovered that someone had probably found the fix for this issue already and utterly failed to inform us about it.

Sun Temple With Ancient Artwork, used under license

Here's the situation: the app we're building is a rewrite of a much-loved internal app for our company, using a more modern tech stack: MVC5, Dapper, etc, all of which connect to an ORACLE database which, for reasons too convoluted to discuss here, we cannot refactor or remove. On our local machines and on our "Beta" server (which is where we have our QA department do their tests), the app works fine. It connects to the ORACLE database, runs queries, returns data, the works. But when we move the app to our staging server (which runs IIS 7 and connects to production databases to allow us to do sanity-checks before deployment) the app completely loses its mind and throws the above error.

Rajit, Dave, and I figured that the issue had to be either something to do with the app pool settings or some stupid coding mistake having to do with the ancient ORACLE DLLs. So we and the server admins started poring over the app's code and the IIS setup, scanning for some simple clue as to how to drive out the insanity that ancient ORACLE and modern IIS had inflicted upon us.

And we found nothing. At least, nothing obvious.

So, in desperation, we resorted to turning things off to see if that fixed the problem. That lead to us ultimately finding, after a day and half of searching, the very setting that had been causing all of the strife we'd been experiencing this project. This bug was so subtle, so innocuous, that none of us thought to look there. In fact, all of us presumed that it couldn't even be the issue. We prematurely concluded that the newest, shiniest thing (IIS 7) was less likely to be the cause of the problem than primitive, creaky ORACLE.

See where this is going?

After much frustration, we eventually located the cause of the bug; it was a setting in the IIS app pool called "Enable 32-bit Applications". In other words, the source of all of our grief for 36 hours was this thing:

A screenshot of an app pool's settings dialog in IIS, with

Here, let me make that a little more obvious.

A screenshot of an app pool's settings dialog in IIS, with

Perhaps I should explain.

ORACLE cares about the bitness of the machine it is running on, meaning whether said machine runs a 32-bit or 64-bit operating system. The 32-bit DLLs for ORACLE do not work on a 64-bit OS, and vice-versa. We figured that we were in the clear, though, since the production server's OS that was running our app is 64-bit and the app itself is using the 64-bit ORACLE DLLs.

(As a side note, ORACLE caring about bitness was the reason I wrote Using 64-bit IIS Express in Visual Studio, since everyone on my team needed to flip that switch in order to be able to debug the app locally.)

Did you catch what the description of that "Enable 32-bit Applications" setting actually said, though?

If set to true for an application pool on a 64-bit operating system, the worker process(es) serving the application pool will be in WOW64 (Windows on Windows64) mode. Processes in WOW64 mode are 32-bit processes that load only 32-bit applications.

Well, crap. Maybe if we'd slowed down a bit and RTFM we'd have saved ourselves some headache.

This was, obviously, reproducible in both the "Beta" and production environments. The reason that our app didn't immediately fail on our "Beta" server was that, on said server, the application was in its own app pool with "Enable 32-bit Applications" disabled. Apparently some idiot already ran into this issue months ago and forgot to tell us about it, seeing as we couldn't come up with another reason why the app pool was set up this way (and, y'know, the server logs told us such). Regardless, once we set up the application on production in its own app pool with the setting disabled, it worked like a charm.

So, yeah, I hate this stupid-ancient version of ORACLE we're forced to us (even though it wasn't the cause of the problem). It's old, it's slow, it requires us to care about whether the server it's running on has a 32-bit or 64-bit architecture. It's tacky and I hate it.

But mostly, I hate that there is someone out there, in our own company, who took the time to read the friggin manual, figure this nightmare out, and then proceed to not leave a note to inform us devs that this dumb little IIS setting would be the root cause of this infuriating bug. If you've solved a problem that others might encounter, and the solution was non-obvious, can't you leave a comment, a sign, a carrier pigeon, something to tell the people that come after you about it? Is it really that difficult to do?

That guy? Yeah, I hate that guy the most.

Of course, that guy is me. So say the server logs. Take that, me!

The Bug Is In Your Code

Have you ever been hunting a bug and been absolutely sure that it was in someone else's code, only to find out that, nope, it was in yours all along? I sure did. Come along with me as we explore my latest minor failure and remind ourselves that, most of the time, the bug is in your code.

An Emperor Gum Moth, a large, brown, hairy insect, seen in flights. Emperor Gum Moth from Wikimedia, used under license

Well, hopefully that bug isn't in your code, but you never know.

Bad Assumptions, Bad Actions

We inherited a new application, a chat system built using SignalR and .NET 4.5, which we needed to deploy to a brand-new IIS8 server box. The system consists of two pieces: a hub, which acts as the "server" for the broadcast system, and the client web application, which received broadcasts. At the time we inherited the system, the hub had already been deployed; we just needed to get the client app on the correct server and we'd be good to go.

The deployment itself went off without a hitch; the web app was accessible to the right people, let us log in, basically worked. Or so it seemed.

As part of sanity-checking the deployment, we programmers logged in and tried to create conversations and send messages through the chat system. We could create messages just fine, but after the first message was sent, no further messages could be transmitted. A quick debugging showed that the SignalR piece on the web client was failing at some point while trying to reach the hub.

We checked and double-checked the web client settings, and couldn't find anything that might cause this error. That's when we went to the server team, pointed out our problem to them, and asked if they could help us figure out what exactly was going on here. Clearly this was a server or connections issue; it was deployed onto a clean IIS8 machine which may not have gotten the correct settings applied, and SignalR was failing to connect. That means network problems, right?

Except, the server team couldn't find anything wrong. They did whatever server magic they do (I'm SO not a server guy) and didn't detect any problems. Yet our debugging suggested that the error was in the network; possibly permissions or access-related. For a few days we went back-and-forth over this, arguing and haggling and posturing until we were so bloody tired of even talking to the other side.

I went back to try to debug this again, still flustered because I was convinced that this was a network problem, but the network team couldn't find any explanations for why we might be getting this error. Come on, network guys, it's obviously some kind of network issue, and you can't even find it?! This can't be my fault! Bewildered (and kinda angry) at their lack of insight, I stepped through the web client code once more.

It CAN Be Your Fault

It was during this bug hunting session that I decided to re-read the actual text of the connection error message I was seeing:

SignalR: chathub.[MethodName] failed to execute. Error: the underlying provider failed on Open.

Hmmm, I thought, that error only appears after initiating a hub request; seems to take about a minute to appear. So it clearly is hitting the hub, it's just not getting any response back. Now that I'm thinking about it, I've often seen that "underlying provider failed on Open" error when something fails to connect to a database... Crap.

It was like a lightbulb came on in my head.

The Centennial Light, which has been on continuously since 1901

I quickly started rummaging through the hub settings (remember that it was deployed before we took ownership of this application). Sure enough, the profile of the hub that had been deployed to production was using a connection string that connected it to our staging database server, not the production one. With the wrong connection string, the hub couldn't connect to the database (since in our company, production servers cannot contact staging ones, and vice-versa), hence it was throwing that error. We could create conversations because that happened in the web app, which had the correct production connection string, but sending messages after creation of the conversation required the hub to work.

All of this, of course, meant that the error was not the server team's fault. It was mine, as my team (including me as the lead developer) was responsible for this app now. I didn't deploy the hub, and I didn't write it, but this bug was my responsibility, and I failed to research it enough.

This is what happens when we assume the solution and don't take the time to do our own thorough research. I was so certain that the hub had been deployed correctly and that our settings were correct that the only remaining possible issue, to my mind, was the network settings. After reaching that erroneous conclusion, I became blinded to the possibility that the bug was still in my code.

Of course, it was in my code, and of course, I blamed other people for what was ultimately my responsibility. Not exactly my finest moment, I'll admit.

This is yet more proof that we're all (especially me) not as smart as we think we are. Most of the time, the bug is in our code. Not in a library, not in some other person's service, not in the network, not "somewhere else". In our code.

Jeff Atwood put this another way: he wrote that the first rule of programming is that it is always your fault. If you assume that the problem exists in code you are responsible for, you are more likely to actually locate the bug (because most of the time, it genuinely is there). Even if you don't locate the bug, you've learned enough about the system in the process that you should have a much better idea of where it might exist.

Assume that the bug is in your code! Only after you've proven that it isn't can you start to explore other possibilities.

We've all seen how my forgetting to check my code thoroughly ended up making me look like an idiot. Has this ever happened to you, or even some "friend" that you know (wink wink)? Share in the comments!

The Bug Hunt From Hell (Or, Why You Shouldn't Forget About Caching)

We have an in-house media server that stores internal images, videos, etc. This project has many components, but the major ones are:

  • A database, where information about the media items is stored.
  • A service layer, which allows multiple apps to use the server and database.
  • A private site (which talks to the database through the service layer) that allows certain individuals to upload media items.

With lots of pieces, something was bound to break eventually. This week it did, and the resulting bug hunt almost had me convinced that I'd gone insane.

A sign reading Insanity Occupy Sign by ERAGON, used under license

The Media Server Setup

For most of the past three years, the Media Server has been running along very smoothly, with no really big issues. Lately, though, we've been getting some complaints about this application, specifically that deleting items from the server didn't actually delete the items; they were still
visible to the user that had uploaded them.

I assigned a task to my coworker Diana to investigate this issue. Because I was the most recent person to actually work on Media Server, I told her to come to me with any questions she had in the process of tracking down this bug.

In regular usage of the Media Server's private upload site, the flow goes like this: when you first log into the site, you are presented with a list of media items that you uploaded ("My Uploaded Items"). You can then apply filters (like videos only, or images only) and sorting, and each of these invokes a specialized query for that scenario.

Here's what Diana did to try to find the bug.

  1. First, she ran the query showing all of her uploaded items. This did not show the video she was testing with (which is what we expected, since she had not uploaded it yet).
  2. Next, she uploaded the test video.
  3. The uploaded video did not appear on her "My Uploaded Items" view. She switched the filter to "only videos" and the newly-uploaded video now appeared.
  4. She deleted the video.
  5. When she looked at the "My Uploaded Items" view the now-deleted video did not appear (which, by itself, is the correct behavior).
  6. But when she switched the filter to "only videos" the deleted video was still being shown.

So, was the video deleted or not? We spent a little over an hour trying to work that out.

The Bug Hunt From Hell

We started out doing the normal debugging things: double-checking settings, testing the connection strings, generally figuring out if we'd missed anything obvious. When that failed to locate the bug, we moved to increasingly desperate ideas. Eventually, one of us figured out that this particular bug only made sense if it existed in service layer; the database layer had no issues when we tested it separately from the entire stack, and the private site couldn't logically cause the bug since it only read the data it received from the service, so the only thing left was to test the service itself.

Sure enough, our testing revealed that on Diana's machine, the bug was entirely reproducible. The video that she had deleted would still appear in her "only videos" filtered query when running the Admin app locally. Being the good testers we are, we pulled up WCFTestClient and started hitting the service layer that way.

When we tested the "only videos" query through WCFTestClient, the result set did not include the deleted video. So if we ran Diana's instance locally, we saw the deleted video (which is a bug), but if we executed the same service call through WCFTestClient, we did not see the video (the proper execution). But it was the same damn method, with identical inputs. We were executing the same call twice and getting different results; isn't that the definition of insanity?

As I am not (as far as I know) actually insane, I was flabbergasted. Desperate for some insight, I started combing through the code, line by line. Eventually, we came to the service class (which is an older ASMX service), where we encountered this lovely little snippet:

[WebMethod(CacheDuration=86400)]
public List<Items> MethodName(...)  
{ }

For those of you who may have never seen that attribute before, CacheDuration in this case is setting the results of this call to be cached for 86400 seconds, or 24 hours.

Dissecting The Bug

Here was our problem. Since the caching was set to such a long duration, the first time Diana ran the "My Uploaded Items" query it did not show the testing video, since it did not exist. Then, when she uploaded said video, "My Uploaded Items" would still not show the new item because the results of that query were cached, and those results did not include the new video.

Further, the "only videos" query was run after the video was uploaded, but before the video was deleted. Consequently, the video continued to show in that query because the results were cached.

A check of the deployment logs showed a recent deployment for the media service, which is when the complaints about videos not deleting started showing up. Sure enough, that deployment included this cache setting. The reason the cache was set to last for so long was that we have a lot of reads from other, external services that are hitting that method, and for performance reasons we didn't want to run that query every time we needed the same set of videos. The real solution here is to change the architecture so that the upload site isn't using the same methods and caching rules as the external services, but for now setting the cache down to 15 minutes seems to be a reasonable compromise.

Anyway, that's my bug hunt from hell story of the week. Now I'll have to remember to check for crazy cache settings as part of my debugging process. What kinds of bugs have you run into? Did any of them result in a crazy hunt? Share in the comments!