UPDATE (17 May 2017): We've fixed the stupid bug described in this post! Check it out!
Bug hunts can take on many forms. Some are simple cleanup jobs, some take hours and end up being something stupid, and some look rather easy and turn out to be impossible. My group recently encountered a fourth kind, a bug where the solution is found but makes no sense, and we could use some help figuring out why this bug exists.
Birth of a (Stupid) Bug
Let me set the problem up. I had an ASP.NET Web API controller that looked something like this:
[RoutePrefix("auth")]
public class AuthController : ApiController
{
[HttpPost]
[Route("login")]
public IHttpActionResult Login(LoginUserRequest request)
{
//...
}
[HttpGet]
[Route("testOracle")]
public IHttpActionResult TestOracle()
{
//...
}
}
Naturally, this results in URLs that look like these:
- /auth/login
- /auth/testOracle
The bug was that any time I attempted to execute the /auth/login action, I got a response of 500 Internal Server Error. For those of you who might not work in the web world, that status message is about as helpful as a tent in a hurricane; it's a generic error message used when we don't know what actually happened (which is to say, it's used almost all the time). Worse, it didn't matter what inputs I POSTed, I always got the 500 error.
Now, it's not really a secret around here that I hate ORACLE, though sometimes unjustly. So when this bug started rearing its ugly head, I just assumed (apparently not learning from my mistakes) that ORACLE had something to do with it. I extracted the ORACLE part of the /auth/login method into the new /auth/testOracle method, and when that failed too, I thought I'd found the problem.
If you've read this blog for a while, you might see where this is going. You're right, and you're also wrong, just as I was.
You're right because I was making bad assumptions. Those assumptions caused me to waste two weeks blaming the problem on the ORACLE team. When we have problems like this, often it actually is ORACLE causing the problem, but not always. I was succumbing to lazy thinking by not investigating sooner than I did.
You are wrong (and, to be fair, so was I) because the actual problem was not something we could have predicted.
Tracking the Monster
My coworker Nate and I sat down early last week to try to figure this out once and for all. This bug appeared to only happen on our dev environment, not locally, so we would make incremental changes to the API, deploy that via our new CI/CD process, hit it with a known-working call to test that and make the app compile, and then hit the known-bad method. We did this over and over, for an entire morning.
Normally, when we hit this app, it has to compile before it can respond to requests (as do many ASP.NET applications). That means there's a delay of about 5-10 seconds before we get a response to the first request, and no such delay for any subsequent requests.
In my frustration, I made a deployment to the dev server and immediately tried to hit the known-bad method in my browser. The route to that method looks like this:
- /auth/testOracle
While I was turned away from my monitor in disgust, Nate noticed that the method had responded immediately. There had been no compilation delay, which should not happen, but that wasn't even the worst of it. On a whim, we tried several more method calls in the same route:
- /auth/login
- /auth/getuserinfo
- /auth/changepassword
Each of them had the same problem: they would respond immediately after a new deployment, with no compilation delay. Something very strange was going on here.
We conferred with the server admin group, and they swore up and down that there was nothing wrong with the configuration on this server. We inspected the network traffic, checked error logs, and found nothing. This thing completely stumped us.
As is normal when we go bug-hunting, we immediately tried to determine the commonality between all these failing methods. The only thing we could come up with was the route prefix /auth, which we changed to:
- /authentication/login
- /authentication/getuserinfo
- /authentication/changepassword
- /authentication/testOracle
And, to our considerable surprise, it worked! All our methods would now execute properly.
But WTF? Why would this only happen on the dev server, not locally? I figured it must be that "auth" was some kind of keyword on this server, and IIS was automatically redirecting any requests beginning with "/auth" to another location it couldn't find.
So, change "auth" to "authentication" and all's well, right? Not quite.
We had a lot of route prefixes in this app. Here's just a few of them:
- /auth
- /messages
- /equipment
- /photos
- /userdata
All of them worked, except for /photos. The calls to the methods in that route had the exact same issue as the calls to the /auth route; they responded immediately with a 500 error.
A Plea for Help
So now we have two failing routes, /auth and /photos, and no idea how or why they are failing. Changing them to /authentication and /pictures makes them work, but the problem is still basically magic and I am loathe to leave it alone.
So, I'm asking for help from you, dear readers. Has anyone seen an issue like this before? Do you have any links or documents we could read to figure out why /auth and /photos, of all our 20+ route prefixes, are failing? This issue won't prevent us from finishing this project, but we'd truly be grateful for any assistance.
If you have any ideas, even ones you think we might have already tried, let us know in the comments.
Thanks, and Happy Coding!