stories

Posts involving characters, whether original fiction or otherwise.

Solving the Stupid Bug

A few weeks back I wrote about a very unusual bug that returned 500 errors for routes that should have been valid. Recently my team and I, to our great relief, discovered the cause of this bug. Of course, that wasn't before we ended up yelling "WTF?!" a great many times.

A set of letter cubes from the game Boggle!, spelling out

A great many times.

A New Failing Route

Most of the our collective WTFs occurred when we discovered the original problem. As I wrote about in the earlier post, we have an ASP.NET Web API application in which the routes /auth and /photos had been returning 500 errors whenever we tried to hit them. The odd thing was, the 500 errors always got returned immediately, even when there should have been a compilation delay. (Go read the whole post if you want the complete backstory).

We implemented a temporary fix by changing the routes for /auth and /photos to /authentication and /pictures, but that obviously is not a good long-term solution for something that should never have happened in the first place. (Not that I have a great track record with things that should have never happened in the first place, but hey, I'm an optimist.)

And then my coworker Diya noticed something that stunned us: now a new route was failing. The route /phones was also giving us the dreaded no-compilation-delay 500 errors.

Diya spent a day proving that /phones had the same issue that the other routes did, just like Nate and I had done a few weeks ago. And, sure enough, she came to the same conclusion: the route never responded for any request made to it.

But this didn't make sense to me. How could a route, which had been previously working, suddenly starting returning 500 errors? WTF?

Determining the Root Cause

I'd had just about enough of this. If our API's routes were going to randomly start failing for no damn reason, I needed to know why. I pestered our server team (which, admittedly, usually has bigger fish to fry than tracking down some fiddly route problem on a dev server) until they took a look at the server which hosted our API.

When they did finally peruse through the server, the told me that several other web applications existed there. They also sent me this screenshot:

A screenshot of the IIS application our API was running in.

In case this wasn't clear enough, let me make the exact problem a bit more obvious:

The same screenshot from IIS, with the offending three apps highlighted.

The server which hosted our API had empty web apps on it that corresponded exactly to the failing routes! So, whenever we tried to hit /auth or /phones or /photos, IIS would redirect that request to the corresponding app, and because that app wouldn't actually work, we'd get back 500 errors.

Hooray! We'd found the bug, and now we could finally kill it! Except...

Here's where the real fun begins. From our perspective, the route /phones (the route that Diya was working on) started failing after /auth and /photos (the routes that Nate and I discovered didn't work a few weeks ago). That, to me, suggests that the Phones application was created at a later time. But the server team swears up and down that all three apps were created at the same time.

How Did This Happen?

So now there are three possibilities. Either:

  1. The Phones web app was in fact created later, even though there's no record of that happening OR
  2. Something in our Continuous Integration/Continuous Deployment process is automatically creating these apps (which is extraordinarily unlikely) OR
  3. We never ran any code against the /phones route in the three weeks of testing we've been doing (or, if we did, didn't notice that it failed).

Occam's Razor suggests that the answer is (3). I personally don't see how that's possible with the amount of testing we've been doing over the last several weeks, but (1) and (2) are so much more unlikely that it's almost ludicrous to decide that either was the culprit. Logic suggests that we simply didn't test the /phones route, and in other news, I now hate logic.

Summary

We finally figured out why our routes were failing: there were empty applications on the server that matched the route names. Since those applications have been removed, we've been able to reset the routes to the names we wanted in the first place (/auth, /photos, and /phones). But now we're left with a worrying possibility: was an entire route (/phones) failing for three weeks and we didn't notice?

But, hey, at least it is fixed. That's good, I think. Right?

Happy Coding!

A Stupid Bug and a Plea for Help

UPDATE (17 Apr 2017): We've fixed the stupid bug described in this post! Check it out!

Bug hunts can take on many forms. Some are simple cleanup jobs, some take hours and end up being something stupid, and some look rather easy and turn out to be impossible. My group recently encountered a fourth kind, a bug where the solution is found but makes no sense, and we could use some help figuring out why this bug exists.

Birth of a (Stupid) Bug

Let me set the problem up. I had an ASP.NET Web API controller that looked something like this:

[RoutePrefix("auth")]
public class AuthController : ApiController  
{
    [HttpPost]
    [Route("login")]
    public IHttpActionResult Login(LoginUserRequest request)
    {
        //...
    }

    [HttpGet]
    [Route("testOracle")]
    public IHttpActionResult TestOracle()
    {
        //...
    }
}

Naturally, this results in URLs that look like these:

  • /auth/login
  • /auth/testOracle

The bug was that any time I attempted to execute the /auth/login action, I got a response of 500 Internal Server Error. For those of you who might not work in the web world, that status message is about as helpful as a tent in a hurricane; it's a generic error message used when we don't know what actually happened (which is to say, it's used almost all the time). Worse, it didn't matter what inputs I POSTed, I always got the 500 error.

Now, it's not really a secret around here that I hate ORACLE, though sometimes unjustly. So when this bug started rearing its ugly head, I just assumed (apparently not learning from my mistakes) that ORACLE had something to do with it. I extracted the ORACLE part of the /auth/login method into the new /auth/testOracle method, and when that failed too, I thought I'd found the problem.

If you've read this blog for a while, you might see where this is going. You're right, and you're also wrong, just as I was.

You're right because I was making bad assumptions. Those assumptions caused me to waste two weeks blaming the problem on the ORACLE team. When we have problems like this, often it actually is ORACLE causing the problem, but not always. I was succumbing to lazy thinking by not investigating sooner than I did.

You are wrong (and, to be fair, so was I) because the actual problem was not something we could have predicted.

Tracking the Monster

My coworker Nate and I sat down early last week to try to figure this out once and for all. This bug appeared to only happen on our dev environment, not locally, so we would make incremental changes to the API, deploy that via our new CI/CD process, hit it with a known-working call to test that and make the app compile, and then hit the known-bad method. We did this over and over, for an entire morning.

Normally, when we hit this app, it has to compile before it can respond to requests (as do many ASP.NET applications). That means there's a delay of about 5-10 seconds before we get a response to the first request, and no such delay for any subsequent requests.

In my frustration, I made a deployment to the dev server and immediately tried to hit the known-bad method in my browser. The route to that method looks like this:

  • /auth/testOracle

While I was turned away from my monitor in disgust, Nate noticed that the method had responded immediately. There had been no compilation delay, which should not happen, but that wasn't even the worst of it. On a whim, we tried several more method calls in the same route:

  • /auth/login
  • /auth/getuserinfo
  • /auth/changepassword

Each of them had the same problem: they would respond immediately after a new deployment, with no compilation delay. Something very strange was going on here.

We conferred with the server admin group, and they swore up and down that there was nothing wrong with the configuration on this server. We inspected the network traffic, checked error logs, and found nothing. This thing completely stumped us.

As is normal when we go bug-hunting, we immediately tried to determine the commonality between all these failing methods. The only thing we could come up with was the route prefix /auth, which we changed to:

  • /authentication/login
  • /authentication/getuserinfo
  • /authentication/changepassword
  • /authentication/testOracle

And, to our considerable surprise, it worked! All our methods would now execute properly.

But WTF? Why would this only happen on the dev server, not locally? I figured it must be that "auth" was some kind of keyword on this server, and IIS was automatically redirecting any requests beginning with "/auth" to another location it couldn't find.

So, change "auth" to "authentication" and all's well, right? Not quite.

We had a lot of route prefixes in this app. Here's just a few of them:

  • /auth
  • /messages
  • /equipment
  • /photos
  • /userdata

All of them worked, except for /photos. The calls to the methods in that route had the exact same issue as the calls to the /auth route; they responded immediately with a 500 error.

A Plea for Help

So now we have two failing routes, /auth and /photos, and no idea how or why they are failing. Changing them to /authentication and /pictures makes them work, but the problem is still basically magic and I am loathe to leave it alone.

So, I'm asking for help from you, dear readers. Has anyone seen an issue like this before? Do you have any links or documents we could read to figure out why /auth and /photos, of all our 20+ route prefixes, are failing? This issue won't prevent us from finishing this project, but we'd truly be grateful for any assistance.

If you have any ideas, even ones you think we might have already tried, let us know in the comments.

Thanks, and Happy Coding!

"I Don't Trust Anything That We Didn't Build"

The problems started small, as they often do. But as we've seen many times before, lots of small problems in quick succession tend to make one big problem.

In this case, the problem got big fast. It started off easy enough: read the big report, find the bug, fix it, the usual. Our bug-tracking team located the source of the issue right away, and my team set about trying to work out the fix. We found the data source that was causing the issue, and it happened to be a web service owned by another team. We couldn't see into it, we could only see the inputs and outputs; it was essentially a black box to us.

Here's where the bad part starts. Due to a lack of efficient communication on all sides, impending deadlines, frustrated coders and general hesitancy to deal with this particular, ahem, time-intensive project, the actual bug fix took about four days to nail down. Yes, four days; we were just as annoyed as you are, and probably more so.

To make a long story short, the project we were attempting to deal with was:

  • old,
  • slow,
  • in desperate need of a rewrite,
  • using a data source which we had no visibility into (the aforementioned service),
  • not written by anyone currently on the team,
  • still our responsibility to fix AND
  • needed to be fixed right friggin now.

You should read that list and cringe a little for each bullet point. I know I did.

All of those problems put together (plus the fact that it took us four days to figure it out) prompted my manager, normally a well-reasoned, thoughtful individual, to say during our bug post-mortem:

"I'm starting to really loathe this project. It's getting to the point where I don't trust anything that we didn't build."

I have to say, it's hard to blame him for wondering if we shouldn't be using things that were not invented here.

It's incredibly easy for a software development team, even an experienced one like mine, to fall into the comfortable trap of believing that everybody else's code is terrible and their own is awesome. We developers often forget (or, quite possibly, want to forget) that most of the time the bug is in our code, no matter how much we wish that it wasn't.

Do this enough and the untamed wild of other people's code starts to look like an intimidating place. It's safer, easier, to believe that your code is correct and everyone else's is wrong, because that means you don't have to spend time trying to understand what the other person was thinking. Or, just as often, spending time figuring out how you are wrong, something nobody enjoys doing.

I've written before that I believe code should exist for a reason. The difficulty in working with other people's code is that not only are you trying to understand what the code does, you're trying to comprehend the reason why it does that. That's a difficult thing to accomplish in the best of times (efficient communication being a feat that usually fails, except by accident), and when you're approaching a deadline and trying to have a meaningful conversation with the original developer who has his own deadlines and responsibilities to deal with, it can be nigh impossible.

Let me be perfectly honest: there are times I completely understand my manager's frustration. It would be SO much easier if the only code I had to deal with was my own, because then the only stupid person in the equation is me and I can fix that. Dealing with other stupid people is infinitely more frustrating than dealing with your own stupidity.

To be clear, I am not calling my coworkers stupid; they are in fact quite the opposite. But it's tempting to fall back to lazy thinking and believe they are stupid merely because they were dealing with requirements and scenarios that I didn't have time to thoroughly understand. That temptation, to believe that things are stupid because I don't understand them, is something I find myself fighting against on a daily basis. It's an innate human quality, and not unique to programmers or other technical people.

Here is a basic fact of life: people, on the whole, are not stupid. Programmers do not write code for no reason, as the best code is no code at all and if we could have our way there would be no code, ever. But because code needs a reason to exist, it almost certainly had a set of requirements, or scenarios, or something which shaped its current form. Even if those requirements were merely thoughts in the original developer's head, they existed. It is not the fault of that developer that some idiot who saunters up to a laptop and is trying to break her code doesn't understand what said code is meant to do.

But it's easy to think that, isn't it? It's easy, it's simple, it's lazy. When we don't have time or energy to think, really think, the lazy thoughts are what we are left with. Given that programming is an almost-entirely-mental task, accepting the lazy thoughts as fact could even be seen as a reprieve from needing to think critically all day, every day.

Resist the lazy thoughts. Resist the idea that your fellow programmers are stupid, or wrong, or only doing a half-done job. Resist Not Invented Here syndrome. Resist the idea that because someone didn't understand you, they're dumb. Resist all these little thoughts that end up with a conclusion of "those other people are stupid," and instead try to answer "what were they trying to accomplish?" There's nothing wrong with digging a little deeper for a better understanding.

That's what I say to you: resist the lazy thoughts, and dig a little deeper. You will eventually have to trust something you didn't build. If you keep digging, you'll find what you are looking for.

Post image is Digging a hole for the digester from Wikimedia Commons, used under license.

How Do You Fix An Impossible Bug?

Within the span of an hour, it had all gone to hell.

The first deployment went rather smoothly. It was a fix to an existing web service, and went out with no problems, or so we thought. Within ten minutes of the deployment, the users started complaining of a minor bug, one that was seemingly omnipresent but didn't really stop them from doing meaningful work. The team which had sent out the deployment immediately set to work figuring out what was going on.

Unrelated to that deployment and forty minutes later, my team was launching a major change to a web site that was consuming that other team's web service. When our change went out, the bug that the users had been complaining about from the web service deployment vanished, replaced by a major bug that caused them to be unable to do any work at all. Naturally we were a little concerned.

The major bug caused an error message that included this statement:

"Unknown server tag 'asp:ScriptManager'."

Now usually when I see an error message like that, I think the project is missing a reference. But no, the references the project needed did, in fact, exist. The second thing I think is that it was using the incorrect version of .NET (in this case, 2.0). Wrong again, it's using the proper version. So now I'm a bit stumped; I pull one of my developers off his project to work on this, and he and I go to our manager and the web service team to try to hash this out.

It took the five of us about an hour to work out where exactly the problems were. As so often happens with major problems like this, it wasn't one problem but several, all intersecting at one time. They snowballed like so:

  • The previous day, my group attempted to roll out the major change to the web site. The roll out didn't work, and another manager (who had previously owned the project, and unbeknownst to me) immediately copied the application to a different location on the same server. He figured this would solve the problem, as it had before with a different app; it didn't.
  • Before the web service change went out, the users had already been notified to use the new location. Consequently they started complaining about the major error.
  • When the web service change was deployed, a different set of users within the same group complained about the minor bug, as word had not reached them to use the new location.
  • When our web site change went out (to the original location), the users of that site noticed and now complained of a "new" bug at the old location, despite the fact that it was the same bug as the bug at the new location.
  • All of this taken together meaning that the fact that our web site and their web service were related was a coincidence. The fact that the two deployments went out so near to each other had nothing to do with what the actual problem was. Coincidences like this are the worst possible thing that can happen when trying to find a bug.

Got all that?

Ultimately we worked out the problem. Well, really, we stumbled onto it. How we got there was such blind luck that I'm not convinced we actually solved the problem so much as lucked into a solution.

A bit of googling for the error message revealed this StackOverflow answer which states that the reason we get the above error is that a piece of the web.config is missing. Here's what it should be.

<pages>  
    ...
    <controls>
        <add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=3.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>
    </controls>
</pages>  

We had previously confirmed that in our application configuration that line did, in fact, exist. Obviously this could not be the problem. (I apparently forgot about what happened the last time I assumed something was true.) Later, when we started getting more desperate to find the source problem, I had our server team give me a copy of the app running on the production servers. This is what I found:

<pages>  
    ...
    <controls>
        <!--<add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=3.5.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>-->
    </controls>
</pages>  

Yes, you read that right. For some damn reason, the offending line of configuration code had been commented out.

And we had no idea how this happened.

We tried examining the config transforms; there was nothing that would result in that line being commented out. We tried looking at the server logs, the build logs, the source control history, anything that might give us a scrap of information as to how the hell this happened. We found nothing.

As you might imagine, this was a little frightening. This line got commented out, and we couldn't reproduce how. How can you fix a bug that should have never occurred in the first place? When we deployed the corrected configuration file, it worked, of course. But in the meantime we had wasted nearly an entire day looking for something that should have been impossible.

But was it impossible, or did we miss something? I'm inclined to believe the latter. One of the things I stress to my team when they come to me with bug reports is the important question of what changed? What changed between when the system worked and when it didn't? Was it business rules, data sources, the build process? If we can determine what changed, the time needed to pinpoint the actual bug shrinks dramatically. In this case, either we simply didn't know what had changed (the most likely scenario) or nothing had changed (the far scarier scenario). Either way, something was off and we couldn't determine what it was.

What was even more worrisome was that there had been a minor bug reported before the major bug showed up, the one that was annoying but not work-stopping. That minor bug was not reproducible now, so it's entirely possible it's still lurking out there, waiting for some unsuspecting user to click the wrong button or enter the wrong date and then blow up the whole system. We don't think it will be that serious, but we also don't really know that it won't. We're stuck in bug-induced purgatory.

That's a terrible feeling. Something went wrong, and you can't figure out how. You know what the solution is, but you don't know why.

I suppose I should just be happy we figured out the problem. I am, sort of. And yet I am forced to conclude that there exists a bug which caused a critical part of our application configuration to be commented out for no discernible reason. A part of me still wonders: how can I find and fix that bug, a bug that should have never existed in the first place?

And what if it happens again?

What about you, dear readers? What are some bugs you found that you couldn't source? How did you fix these "impossible" bugs? Share in the comments!

Happy Coding Debugging!

Post image is 2008 broken computer monitor from Wikimedia, used under license.