Solving the Stupid Bug

A few weeks back I wrote about a very unusual bug that returned 500 errors for routes that should have been valid. Recently my team and I, to our great relief, discovered the cause of this bug. Of course, that wasn't before we ended up yelling "WTF?!" a great many times.

A set of letter cubes from the game Boggle!, spelling out

A great many times.

A New Failing Route

Most of the our collective WTFs occurred when we discovered the original problem. As I wrote about in the earlier post, we have an ASP.NET Web API application in which the routes /auth and /photos had been returning 500 errors whenever we tried to hit them. The odd thing was, the 500 errors always got returned immediately, even when there should have been a compilation delay. (Go read the whole post if you want the complete backstory).

We implemented a temporary fix by changing the routes for /auth and /photos to /authentication and /pictures, but that obviously is not a good long-term solution for something that should never have happened in the first place. (Not that I have a great track record with things that should have never happened in the first place, but hey, I'm an optimist.)

And then my coworker Diya noticed something that stunned us: now a new route was failing. The route /phones was also giving us the dreaded no-compilation-delay 500 errors.

Diya spent a day proving that /phones had the same issue that the other routes did, just like Nate and I had done a few weeks ago. And, sure enough, she came to the same conclusion: the route never responded for any request made to it.

But this didn't make sense to me. How could a route, which had been previously working, suddenly starting returning 500 errors? WTF?

Determining the Root Cause

I'd had just about enough of this. If our API's routes were going to randomly start failing for no damn reason, I needed to know why. I pestered our server team (which, admittedly, usually has bigger fish to fry than tracking down some fiddly route problem on a dev server) until they took a look at the server which hosted our API.

When they did finally peruse through the server, the told me that several other web applications existed there. They also sent me this screenshot:

A screenshot of the IIS application our API was running in.

In case this wasn't clear enough, let me make the exact problem a bit more obvious:

The same screenshot from IIS, with the offending three apps highlighted.

The server which hosted our API had empty web apps on it that corresponded exactly to the failing routes! So, whenever we tried to hit /auth or /phones or /photos, IIS would redirect that request to the corresponding app, and because that app wouldn't actually work, we'd get back 500 errors.

Hooray! We'd found the bug, and now we could finally kill it! Except...

Here's where the real fun begins. From our perspective, the route /phones (the route that Diya was working on) started failing after /auth and /photos (the routes that Nate and I discovered didn't work a few weeks ago). That, to me, suggests that the Phones application was created at a later time. But the server team swears up and down that all three apps were created at the same time.

How Did This Happen?

So now there are three possibilities. Either:

  1. The Phones web app was in fact created later, even though there's no record of that happening OR
  2. Something in our Continuous Integration/Continuous Deployment process is automatically creating these apps (which is extraordinarily unlikely) OR
  3. We never ran any code against the /phones route in the three weeks of testing we've been doing (or, if we did, didn't notice that it failed).

Occam's Razor suggests that the answer is (3). I personally don't see how that's possible with the amount of testing we've been doing over the last several weeks, but (1) and (2) are so much more unlikely that it's almost ludicrous to decide that either was the culprit. Logic suggests that we simply didn't test the /phones route, and in other news, I now hate logic.


We finally figured out why our routes were failing: there were empty applications on the server that matched the route names. Since those applications have been removed, we've been able to reset the routes to the names we wanted in the first place (/auth, /photos, and /phones). But now we're left with a worrying possibility: was an entire route (/phones) failing for three weeks and we didn't notice?

But, hey, at least it is fixed. That's good, I think. Right?

Happy Coding!

A Stupid Bug and a Plea for Help

UPDATE (17 Apr 2017): We've fixed the stupid bug described in this post! Check it out!

Bug hunts can take on many forms. Some are simple cleanup jobs, some take hours and end up being something stupid, and some look rather easy and turn out to be impossible. My group recently encountered a fourth kind, a bug where the solution is found but makes no sense, and we could use some help figuring out why this bug exists.

Birth of a (Stupid) Bug

Let me set the problem up. I had an ASP.NET Web API controller that looked something like this:

public class AuthController : ApiController  
    public IHttpActionResult Login(LoginUserRequest request)

    public IHttpActionResult TestOracle()

Naturally, this results in URLs that look like these:

  • /auth/login
  • /auth/testOracle

The bug was that any time I attempted to execute the /auth/login action, I got a response of 500 Internal Server Error. For those of you who might not work in the web world, that status message is about as helpful as a tent in a hurricane; it's a generic error message used when we don't know what actually happened (which is to say, it's used almost all the time). Worse, it didn't matter what inputs I POSTed, I always got the 500 error.

Now, it's not really a secret around here that I hate ORACLE, though sometimes unjustly. So when this bug started rearing its ugly head, I just assumed (apparently not learning from my mistakes) that ORACLE had something to do with it. I extracted the ORACLE part of the /auth/login method into the new /auth/testOracle method, and when that failed too, I thought I'd found the problem.

If you've read this blog for a while, you might see where this is going. You're right, and you're also wrong, just as I was.

You're right because I was making bad assumptions. Those assumptions caused me to waste two weeks blaming the problem on the ORACLE team. When we have problems like this, often it actually is ORACLE causing the problem, but not always. I was succumbing to lazy thinking by not investigating sooner than I did.

You are wrong (and, to be fair, so was I) because the actual problem was not something we could have predicted.

Tracking the Monster

My coworker Nate and I sat down early last week to try to figure this out once and for all. This bug appeared to only happen on our dev environment, not locally, so we would make incremental changes to the API, deploy that via our new CI/CD process, hit it with a known-working call to test that and make the app compile, and then hit the known-bad method. We did this over and over, for an entire morning.

Normally, when we hit this app, it has to compile before it can respond to requests (as do many ASP.NET applications). That means there's a delay of about 5-10 seconds before we get a response to the first request, and no such delay for any subsequent requests.

In my frustration, I made a deployment to the dev server and immediately tried to hit the known-bad method in my browser. The route to that method looks like this:

  • /auth/testOracle

While I was turned away from my monitor in disgust, Nate noticed that the method had responded immediately. There had been no compilation delay, which should not happen, but that wasn't even the worst of it. On a whim, we tried several more method calls in the same route:

  • /auth/login
  • /auth/getuserinfo
  • /auth/changepassword

Each of them had the same problem: they would respond immediately after a new deployment, with no compilation delay. Something very strange was going on here.

We conferred with the server admin group, and they swore up and down that there was nothing wrong with the configuration on this server. We inspected the network traffic, checked error logs, and found nothing. This thing completely stumped us.

As is normal when we go bug-hunting, we immediately tried to determine the commonality between all these failing methods. The only thing we could come up with was the route prefix /auth, which we changed to:

  • /authentication/login
  • /authentication/getuserinfo
  • /authentication/changepassword
  • /authentication/testOracle

And, to our considerable surprise, it worked! All our methods would now execute properly.

But WTF? Why would this only happen on the dev server, not locally? I figured it must be that "auth" was some kind of keyword on this server, and IIS was automatically redirecting any requests beginning with "/auth" to another location it couldn't find.

So, change "auth" to "authentication" and all's well, right? Not quite.

We had a lot of route prefixes in this app. Here's just a few of them:

  • /auth
  • /messages
  • /equipment
  • /photos
  • /userdata

All of them worked, except for /photos. The calls to the methods in that route had the exact same issue as the calls to the /auth route; they responded immediately with a 500 error.

A Plea for Help

So now we have two failing routes, /auth and /photos, and no idea how or why they are failing. Changing them to /authentication and /pictures makes them work, but the problem is still basically magic and I am loathe to leave it alone.

So, I'm asking for help from you, dear readers. Has anyone seen an issue like this before? Do you have any links or documents we could read to figure out why /auth and /photos, of all our 20+ route prefixes, are failing? This issue won't prevent us from finishing this project, but we'd truly be grateful for any assistance.

If you have any ideas, even ones you think we might have already tried, let us know in the comments.

Thanks, and Happy Coding!