Matthew Jones's Picture

Matthew Jones

I'm an ASP.NET and Microsoft-stack developer who loves teaching and programming.

Solving the Stupid Bug

A few weeks back I wrote about a very unusual bug that returned 500 errors for routes that should have been valid. Recently my team and I, to our great relief, discovered the cause of this bug. Of course, that wasn't before we ended up yelling "WTF?!" a great many times.

A set of letter cubes from the game Boggle!, spelling out

A great many times.

A New Failing Route

Most of the our collective WTFs occurred when we discovered the original problem. As I wrote about in the earlier post, we have an ASP.NET Web API application in which the routes /auth and /photos had been returning 500 errors whenever we tried to hit them. The odd thing was, the 500 errors always got returned immediately, even when there should have been a compilation delay. (Go read the whole post if you want the complete backstory).

We implemented a temporary fix by changing the routes for /auth and /photos to /authentication and /pictures, but that obviously is not a good long-term solution for something that should never have happened in the first place. (Not that I have a great track record with things that should have never happened in the first place, but hey, I'm an optimist.)

And then my coworker Diya noticed something that stunned us: now a new route was failing. The route /phones was also giving us the dreaded no-compilation-delay 500 errors.

Diya spent a day proving that /phones had the same issue that the other routes did, just like Nate and I had done a few weeks ago. And, sure enough, she came to the same conclusion: the route never responded for any request made to it.

But this didn't make sense to me. How could a route, which had been previously working, suddenly starting returning 500 errors? WTF?

Determining the Root Cause

I'd had just about enough of this. If our API's routes were going to randomly start failing for no damn reason, I needed to know why. I pestered our server team (which, admittedly, usually has bigger fish to fry than tracking down some fiddly route problem on a dev server) until they took a look at the server which hosted our API.

When they did finally peruse through the server, the told me that several other web applications existed there. They also sent me this screenshot:

A screenshot of the IIS application our API was running in.

In case this wasn't clear enough, let me make the exact problem a bit more obvious:

The same screenshot from IIS, with the offending three apps highlighted.

The server which hosted our API had empty web apps on it that corresponded exactly to the failing routes! So, whenever we tried to hit /auth or /phones or /photos, IIS would redirect that request to the corresponding app, and because that app wouldn't actually work, we'd get back 500 errors.

Hooray! We'd found the bug, and now we could finally kill it! Except...

Here's where the real fun begins. From our perspective, the route /phones (the route that Diya was working on) started failing after /auth and /photos (the routes that Nate and I discovered didn't work a few weeks ago). That, to me, suggests that the Phones application was created at a later time. But the server team swears up and down that all three apps were created at the same time.

How Did This Happen?

So now there are three possibilities. Either:

  1. The Phones web app was in fact created later, even though there's no record of that happening OR
  2. Something in our Continuous Integration/Continuous Deployment process is automatically creating these apps (which is extraordinarily unlikely) OR
  3. We never ran any code against the /phones route in the three weeks of testing we've been doing (or, if we did, didn't notice that it failed).

Occam's Razor suggests that the answer is (3). I personally don't see how that's possible with the amount of testing we've been doing over the last several weeks, but (1) and (2) are so much more unlikely that it's almost ludicrous to decide that either was the culprit. Logic suggests that we simply didn't test the /phones route, and in other news, I now hate logic.

Summary

We finally figured out why our routes were failing: there were empty applications on the server that matched the route names. Since those applications have been removed, we've been able to reset the routes to the names we wanted in the first place (/auth, /photos, and /phones). But now we're left with a worrying possibility: was an entire route (/phones) failing for three weeks and we didn't notice?

But, hey, at least it is fixed. That's good, I think. Right?

Happy Coding!

Why Does ODP.NET Have Async Methods If They're Not Asynchronous?

The Oracle Client (ODP.NET) that we are using to build a major web project put us in a really stupid bind last week, ultimately causing me to waste two days in search of something that wasn't really going to work in the first place.

I'm on something of an async/await binge with this project, since in C# and ASP.NET it's really easy to implement and possibly gives you great scalability benefits. So, one of the first things I did was to make all our ORACLE accessing code asynchronous. There's a particular section of the code which needs to call a bunch of stored procedures in a row, and I figured that since the procedures don't really rely on each other, we'd benefit quite a bit by making them asynchronous.

All of this was going swimmingly, until this error started popping up:

"Connection must be open for this operation"

Not being a total expert on this, I would still expect that, given an asynchronous execution, each async call would have its own connection and so this error should never happen. And yet, there it was, repeatedly being logged into our exception logger.

So, to the Google I went, and eventually came across this StackOverflow post saying that the Oracle Managed Driver does not support asynchronous operations.

Yes it does! I thought. It's got all these async methods! They wouldn't name them "async" unless they were actually asynchronous! (Apparently I've never been on the internet before.)

In fact, the Oracle Data Provider for .NET does contain a few asynchronous methods:

  • ExecuteNonQueryAsync()
  • ExecuteReaderAsync()
  • ExecuteScalarAsync()

But the StackOverflow post says that these methods are not actually async, they are synchronous. So what's going on?

Browsing the features list for ODP.NET, I can't find any reference to asynchronous methods, the Task or Task<T> classes, or even the await keyword. In fact, I can't find any documentation at all that says these async methods need to exist, and yet they do. The question is: why?

Microsoft's own documentation says the following:

"In Visual Studio 2012 and the .NET Framework 4.5, any method that is attributed with the async keyword (Async in Visual Basic) is considered an asynchronous method..."

But it never actually specifies that the method has to actually be asynchronous! Rather, only that the method should return a Task or Task<T>, and even that is only a convention!

All the above is to say: technically, ORACLE did nothing wrong by suffixing these method names with "async", since in fact they do return a Task or a Task<T>. In fact, they might even be useful since they can be asynchronous with other, non-ORACLE-accessing methods. But that wasn't my scenario, and how was I supposed to know about this in advance?

The question becomes: since the methods themselves are not actually executed asynchronously, are they of any use?

Short answer? Nope. At least not for my situation. If anything, it's more harmful than helpful.

As far as I can tell, the only reason the asynchronous versions of these methods exist is to conform to some mythical interface mentioned in the comments of this StackOverflow answer. Which just begs the question: why does the interface need these method declarations? Ultimately I decided to get out of the rabbit hole before digging myself too deep and remove all my asynchronous code that used ODP.NET.

So now, in addition to the two days I spent researching this and implementing it for ORACLE calls, I now have to spend at least half a day removing it. All because ORACLE made some "asynchronous" methods that aren't actually asynchronous. Thanks ORACLE!

To anyone else out there who has to use ODP.NET: don't bother with the asynchronous methods unless you really know what you are doing. They're a lot more trouble than they are worth.

And, if you do really know what you are doing, tell me in the comments what situations these bloody things might actually be useful!

Happy Coding!

Extending RestSharp to Handle Timeouts in ASP.NET MVC

We're working on a large project which consumes a custom-build API, and one of the requirements of this project is that if the API goes down, certain sections of the website still need to be able to function. The reason for this is that our app consumes a lot of APIs and services, and occasionally a non-critical one will go down (for maintenance, bugs, whatever) and we don't want our site to crash as well, since this was a problem that plagued the previous version of this app.

To do this, we needed a way to ensure that if a non-critical service or API went down, we could record the fact that a request to said service timed out and NOT throw up a scary-looking error page to our users. We're using the incredible RestSharp library to consume our API, so it occurred to me that the best place for this timeout-handler to exist was in some kind of common class that extended RestSharp's basic functionality. This post lays out what we built in order to record timeouts from our APIs and services (and, as always, a demo application can be found over on GitHub).

API Setup

I've set up the sample project in much the same way that our real project is set up. Closest to the data store (e.g. the "bottom" layer of the stack) is the API, which will create data for our ASP.NET MVC application to consume.

For this demo, let's create a simple API which returns the current date and time.

[RoutePrefix("home")]
public class HomeController : ApiController  
{
    [HttpGet]
    [Route("date")]
    public IHttpActionResult GetDate()
    {
        var rand = new Random();
        Thread.Sleep(rand.Next(1, 100));
        return Ok(DateTime.Now);
    }
}

The call to Thread.Sleep() exists to show that we can handle situations where the API doesn't respond. Any call to this API method will be delayed between 1 and 100 milliseconds before responding. (Just to be clear, don't use that call in production applications.)

Building the Base Client

In order for our ASP.NET MVC app to consume our API, we build classes known as clients, which implement an interface from RestSharp called IRestClient. Normal usage of RestSharp dictates that we create a class which implements that IRestClient interface.

But we have a different scenario than is covered by the normal usage. We need to determine a way to ensure that our clients can handle timeouts, and in a uniform manner. To accomplish this, let's implement a base client class from which all our other clients will derive.

public class ClientBase : RestSharp.RestClient  
{
    public ClientBase(string baseUrl)
    {
        BaseUrl = new Uri(baseUrl);
    }
}

Notice that the base class inherits from RestClient rather than implementing IRestClient.

This base client will need to override the existing Execute() methods in RestClient to have them call our Timeout checking method. Here's the four methods we want to implement (as well as the implementation of the Timeout check):

public class ClientBase : RestSharp.RestClient  
{
    public ClientBase(string baseUrl)
    {
        BaseUrl = new Uri(baseUrl);
    }

    public override IRestResponse<T> Execute<T>(IRestRequest request)
    {
        var response = base.Execute<T>(request);
        TimeoutCheck(request, response);
        return response;
    }

    public override async Task<IRestResponse> ExecuteTaskAsync(IRestRequest request)
    {
        var response = await base.ExecuteTaskAsync(request);
        TimeoutCheck(request, response);
        return response;
    }

    public override async Task<IRestResponse<T>> ExecuteTaskAsync<T>(IRestRequest request)
    {
        var response = await base.ExecuteTaskAsync<T>(request);
        TimeoutCheck(request, response);
        return response;
    }

    private void TimeoutCheck(IRestRequest request, IRestResponse response)
    {
        if (response.StatusCode == 0)
        {
            //Uncomment the line below to throw a real exception.
            //throw new TimeoutException("The request timed out!");
        }
    }
}

We overrode these particular methods to handle four scenarios:

  1. Execute() handles situations in which we don't expect response data (meaning we will get a IRestResponse object with things like the HTTP status code, but we won't get any further data out of it).
  2. Execute<T>() handles situations in which we DO expect response data from the API.
  3. ExecuteTaskAsync() handles scenarios in which we want to asynchronously call the API and don't expect any response data.
  4. ExecuteTaskAsync<T>() handles the final scenario, in which we want to asynchronously call the API and expect to get response data back.

I want to draw particular attention to the TimeoutCheck() method. As far as I can tell, the way RestSharp denotes a timeout is to give the generated Response object a StatusCode of 0. This seems like it could also represent other scenarios, and so if anyone out there has a better way for me to specifically identify timeouts, I'd love to hear it.

Anyway, with the base class now implemented, we can build out client.

Building the Clients

We need to build a specific Client which will consume the API method (GetDate()) that we defined earlier. This class will need to both inherit from ClientBase (to get the custom method we wrote in the previous sections) and implement IRestClient (to get the additional functionality provided by RestSharp). For completeness's sake, we will also implement asynchronous and synchronous versions of the same call.

Here's the complete DateClient class:

public class DateClient : ClientBase, IRestClient  
{
    public DateClient() : base(Constants.BaseApiUrl) { }

    public DateTime? GetDate()
    {
        var request = new RestRequest("/home/date");
        request.Timeout = 50;
        return Execute<DateTime?>(request).Data;
    }

    public async Task<DateTime?> GetDateAsync()
    {
        var request = new RestRequest("/home/date");
        request.Timeout = 50;
        var result = await ExecuteTaskAsync<DateTime?>(request);
        return result.Data;
    }
}

Notice that the request.Timeout property is set to 50 milliseconds. Remember that our API will delay for a random amount of milliseconds between 1 and 100. This setup ensures that we can test our ability to handle timeouts. In the real world, you would set the Timeout property to something considerably more sane than 0.05 seconds.

More importantly, note that these methods return a nullable DateTime?, whereas the API method returns the non-nullable DateTime. This is because if RestSharp does not get a response, the IRestResponse.Data property will be set to the default value of the object it is trying to deserialize (e.g. the T in Execute<T>()). Because we set it to nullable, we automatically know that a null value is the error case, and can handle it accordingly.

Completing the Demo

We only need a couple more pieces to complete the demo. First, let's build an MVC controller which implements both the synchronous and asynchronous versions of our Client methods:

[RoutePrefix("home")]
public class HomeController : Controller  
{
    [HttpGet]
    [Route("index")]
    public ActionResult Index()
    {
        var client = new DateClient();
        var model = new HomeIndexVM();
        model.CurrentDate = client.GetDate();
        return View(model);
    }

    [HttpGet]
    [Route("async")]
    public async Task<ActionResult> Async()
    {
        var client = new DateClient();
        var model = new HomeIndexVM();
        model.CurrentDate = await client.GetDateAsync();
        return View("Index", model);
    }
}

We also need to build a simple Index.cshtml view to display the current date and time:

@model RestSharpTimeoutDemo.MVC.ViewModels.Home.HomeIndexVM

@{
    ViewBag.Title = "Home Page";
}

@if (Model.CurrentDate.HasValue)
{
    <h1>The current date/time is @Model.CurrentDate</h1>
}
else  
{
    <h1>The request timed out!</h1>
}

Finally, we can run the app. Sometimes we will get the error version of this page:

But, when the call doesn't timeout, we will correctly display the date and time.

The important thing is that either way, the user is not taken to some unhelpful error page (or, God forbid, the Yellow Screen of Death), and instead remains in the app where s/he can actually get some work done and not be blocked because we couldn't display the current date and time. If our goal was to get out of the user's way, I daresay this achieves that feat.

Summary

RestSharp is wonderfully extensible, and we took advantage of that to create a common client that handles timeouts. This client allows us to step out of our user's way to provide a more seamless user experience by, essentially, not throwing up a big yellow flag which screams SOMETHING WENT WRONG. And that, I feel, is a win.

If you want to dive deeper into this problem (or see something you could improve), check out the sample project on GitHub!

Happy Coding!

Post image originally by @raysato and modified by me.

A Stupid Bug and a Plea for Help

UPDATE (17 Apr 2017): We've fixed the stupid bug described in this post! Check it out!

Bug hunts can take on many forms. Some are simple cleanup jobs, some take hours and end up being something stupid, and some look rather easy and turn out to be impossible. My group recently encountered a fourth kind, a bug where the solution is found but makes no sense, and we could use some help figuring out why this bug exists.

Birth of a (Stupid) Bug

Let me set the problem up. I had an ASP.NET Web API controller that looked something like this:

[RoutePrefix("auth")]
public class AuthController : ApiController  
{
    [HttpPost]
    [Route("login")]
    public IHttpActionResult Login(LoginUserRequest request)
    {
        //...
    }

    [HttpGet]
    [Route("testOracle")]
    public IHttpActionResult TestOracle()
    {
        //...
    }
}

Naturally, this results in URLs that look like these:

  • /auth/login
  • /auth/testOracle

The bug was that any time I attempted to execute the /auth/login action, I got a response of 500 Internal Server Error. For those of you who might not work in the web world, that status message is about as helpful as a tent in a hurricane; it's a generic error message used when we don't know what actually happened (which is to say, it's used almost all the time). Worse, it didn't matter what inputs I POSTed, I always got the 500 error.

Now, it's not really a secret around here that I hate ORACLE, though sometimes unjustly. So when this bug started rearing its ugly head, I just assumed (apparently not learning from my mistakes) that ORACLE had something to do with it. I extracted the ORACLE part of the /auth/login method into the new /auth/testOracle method, and when that failed too, I thought I'd found the problem.

If you've read this blog for a while, you might see where this is going. You're right, and you're also wrong, just as I was.

You're right because I was making bad assumptions. Those assumptions caused me to waste two weeks blaming the problem on the ORACLE team. When we have problems like this, often it actually is ORACLE causing the problem, but not always. I was succumbing to lazy thinking by not investigating sooner than I did.

You are wrong (and, to be fair, so was I) because the actual problem was not something we could have predicted.

Tracking the Monster

My coworker Nate and I sat down early last week to try to figure this out once and for all. This bug appeared to only happen on our dev environment, not locally, so we would make incremental changes to the API, deploy that via our new CI/CD process, hit it with a known-working call to test that and make the app compile, and then hit the known-bad method. We did this over and over, for an entire morning.

Normally, when we hit this app, it has to compile before it can respond to requests (as do many ASP.NET applications). That means there's a delay of about 5-10 seconds before we get a response to the first request, and no such delay for any subsequent requests.

In my frustration, I made a deployment to the dev server and immediately tried to hit the known-bad method in my browser. The route to that method looks like this:

  • /auth/testOracle

While I was turned away from my monitor in disgust, Nate noticed that the method had responded immediately. There had been no compilation delay, which should not happen, but that wasn't even the worst of it. On a whim, we tried several more method calls in the same route:

  • /auth/login
  • /auth/getuserinfo
  • /auth/changepassword

Each of them had the same problem: they would respond immediately after a new deployment, with no compilation delay. Something very strange was going on here.

We conferred with the server admin group, and they swore up and down that there was nothing wrong with the configuration on this server. We inspected the network traffic, checked error logs, and found nothing. This thing completely stumped us.

As is normal when we go bug-hunting, we immediately tried to determine the commonality between all these failing methods. The only thing we could come up with was the route prefix /auth, which we changed to:

  • /authentication/login
  • /authentication/getuserinfo
  • /authentication/changepassword
  • /authentication/testOracle

And, to our considerable surprise, it worked! All our methods would now execute properly.

But WTF? Why would this only happen on the dev server, not locally? I figured it must be that "auth" was some kind of keyword on this server, and IIS was automatically redirecting any requests beginning with "/auth" to another location it couldn't find.

So, change "auth" to "authentication" and all's well, right? Not quite.

We had a lot of route prefixes in this app. Here's just a few of them:

  • /auth
  • /messages
  • /equipment
  • /photos
  • /userdata

All of them worked, except for /photos. The calls to the methods in that route had the exact same issue as the calls to the /auth route; they responded immediately with a 500 error.

A Plea for Help

So now we have two failing routes, /auth and /photos, and no idea how or why they are failing. Changing them to /authentication and /pictures makes them work, but the problem is still basically magic and I am loathe to leave it alone.

So, I'm asking for help from you, dear readers. Has anyone seen an issue like this before? Do you have any links or documents we could read to figure out why /auth and /photos, of all our 20+ route prefixes, are failing? This issue won't prevent us from finishing this project, but we'd truly be grateful for any assistance.

If you have any ideas, even ones you think we might have already tried, let us know in the comments.

Thanks, and Happy Coding!