Have you ever been hunting a bug and been absolutely sure that it was in someone else's code, only to find out that, nope, it was in yours all along? I sure did. Come along with me as we explore my latest minor failure and remind ourselves that, most of the time, the bug is in your code.
Emperor Gum Moth from Wikimedia, used under license
Well, hopefully that bug isn't in your code, but you never know.
Bad Assumptions, Bad Actions
We inherited a new application, a chat system built using SignalR and .NET 4.5, which we needed to deploy to a brand-new IIS8 server box. The system consists of two pieces: a hub, which acts as the "server" for the broadcast system, and the client web application, which received broadcasts. At the time we inherited the system, the hub had already been deployed; we just needed to get the client app on the correct server and we'd be good to go.
The deployment itself went off without a hitch; the web app was accessible to the right people, let us log in, basically worked. Or so it seemed.
As part of sanity-checking the deployment, we programmers logged in and tried to create conversations and send messages through the chat system. We could create messages just fine, but after the first message was sent, no further messages could be transmitted. A quick debugging showed that the SignalR piece on the web client was failing at some point while trying to reach the hub.
We checked and double-checked the web client settings, and couldn't find anything that might cause this error. That's when we went to the server team, pointed out our problem to them, and asked if they could help us figure out what exactly was going on here. Clearly this was a server or connections issue; it was deployed onto a clean IIS8 machine which may not have gotten the correct settings applied, and SignalR was failing to connect. That means network problems, right?
Except, the server team couldn't find anything wrong. They did whatever server magic they do (I'm SO not a server guy) and didn't detect any problems. Yet our debugging suggested that the error was in the network; possibly permissions or access-related. For a few days we went back-and-forth over this, arguing and haggling and posturing until we were so bloody tired of even talking to the other side.
I went back to try to debug this again, still flustered because I was convinced that this was a network problem, but the network team couldn't find any explanations for why we might be getting this error. Come on, network guys, it's obviously some kind of network issue, and you can't even find it?! This can't be my fault! Bewildered (and kinda angry) at their lack of insight, I stepped through the web client code once more.
It CAN Be Your Fault
It was during this bug hunting session that I decided to re-read the actual text of the connection error message I was seeing:
SignalR: chathub.[MethodName] failed to execute. Error: the underlying provider failed on Open.
Hmmm, I thought, that error only appears after initiating a hub request; seems to take about a minute to appear. So it clearly is hitting the hub, it's just not getting any response back. Now that I'm thinking about it, I've often seen that "underlying provider failed on Open" error when something fails to connect to a database... Crap.
It was like a lightbulb came on in my head.
I quickly started rummaging through the hub settings (remember that it was deployed before we took ownership of this application). Sure enough, the profile of the hub that had been deployed to production was using a connection string that connected it to our staging database server, not the production one. With the wrong connection string, the hub couldn't connect to the database (since in our company, production servers cannot contact staging ones, and vice-versa), hence it was throwing that error. We could create conversations because that happened in the web app, which had the correct production connection string, but sending messages after creation of the conversation required the hub to work.
All of this, of course, meant that the error was not the server team's fault. It was mine, as my team (including me as the lead developer) was responsible for this app now. I didn't deploy the hub, and I didn't write it, but this bug was my responsibility, and I failed to research it enough.
This is what happens when we assume the solution and don't take the time to do our own thorough research. I was so certain that the hub had been deployed correctly and that our settings were correct that the only remaining possible issue, to my mind, was the network settings. After reaching that erroneous conclusion, I became blinded to the possibility that the bug was still in my code.
Of course, it was in my code, and of course, I blamed other people for what was ultimately my responsibility. Not exactly my finest moment, I'll admit.
This is yet more proof that we're all (especially me) not as smart as we think we are. Most of the time, the bug is in our code. Not in a library, not in some other person's service, not in the network, not "somewhere else". In our code.
Jeff Atwood put this another way: he wrote that the first rule of programming is that it is always your fault. If you assume that the problem exists in code you are responsible for, you are more likely to actually locate the bug (because most of the time, it genuinely is there). Even if you don't locate the bug, you've learned enough about the system in the process that you should have a much better idea of where it might exist.
Assume that the bug is in your code! Only after you've proven that it isn't can you start to explore other possibilities.
We've all seen how my forgetting to check my code thoroughly ended up making me look like an idiot. Has this ever happened to you, or even some "friend" that you know (wink wink)? Share in the comments!