When a critical service goes down, what happens to your apps? Do they go down too? Should they? Or should we plan for the service not being available? My team's experiences dealing with a major crash left us asking these questions.
The Crash
Just after lunch this past Thursday, my team started noticing that our TFS server was unresponsive. My boss (let's call him Frank) phoned the server team, who told him that the TFS server had crashed and they didn't know when it would be back up.
We were quite taken aback; such a crash had never happened before. The server team advised us to just wait until they got the server back online, so we patiently waited to hear from them. The Chaos Monkey had chosen us, and now we had to wait to discover the damage it had done.
During that time, we noticed that TFS's crash has also taken down another internal app we were responsible for maintaining (let's call it DSI), one that was semi-important to the internal operations of the company. You see, DSI had a dependency on TFS; we had needed to associate DSI's data to source-controlled projects, and so we did that by querying TFS's projects service to give us the currently-active projects at any given time.
It just so happened that our executives needed that app for a meeting that day, and it was up to me and Frank to figure out what we could do to get the app in relative working order before that meeting started.
Frank and I started scrambling to figure out what we could do to make DSI work for that meeting, since we only had an hour until it began. Luckily, I had pulled the source down early Thursday morning to do some refactoring, and since I already had the most-recent version of the code I started ripping out the dependency on TFS. After a little while, I was able to get it to the point where the executives could use it for their meeting without it crapping out on them.
We got that version of DSI onto a production server ten minutes before the meeting started. Just a few minutes later we started receiving status emails that told us that the app was working properly, and we both breathed a huge sigh of relief.
The Conversation
Frank, being a manager (and a good one at that), started wondering why we had such a dependency on TFS. Our conversation (over chat) went something like this:
FRANK: Why do we even hit TFS from the DSI tool?
ME: Because [the users] needed the ability to search by TFS project, and I can't get a list of projects without hitting TFS.
FRANK: Makes sense. How do we turn that off for cases like this?
ME: I'm not sure we should be trying to account for cases like this. It wasn't really a thing to say "oh TFS might go down, better plan for that."
FRANK: Bear with me here: If a small piece of information is pulled from an outside source, (something that's not important to the application as a whole), if that outside source is down, should it also bring down the entire site?
ME: No, of course not, not in the general case. But in this case it's not reasonable to expect that TFS, which allows all our developers to do any work at all in this company, will go down for any extended length of time.
ME: I would think that would be a disastrous situation, one that is outside the scope of regular planning, because why would I ever plan for [TFS] to go down?
Opposing Views
I don't disagree with Frank's general point here. What he's saying makes objective sense: if one of an app's dependencies suddenly becomes unreachable, why should it take down the entire application? Shouldn't there be some level of redundancy built-in so that one inaccessible dependency doesn't crash a whole bunch of sites?
On the surface, I agree with this argument. My only issue with his logic is that it seems like a good idea in general but not in this specific case, because it requires me to plan for TFS going offline, and IMHO that wasn't a likely occurrence before Thursday, and probably still isn't.
TFS is absolutely critical to our company; we are a Microsoft shop through and through and TFS anchors our development efforts. Because of it's central importance, I thought I was making a reasonable assumption that TFS would always be available, would never go down. Why would I plan for something I believed could never happen? Now I'm not sure if this line of thinking was hubris or pragmatism.
I don't want to plan for every possible circumstance, because then I'd never stop planning. I believe in being pragmatic, in solving for problems I know are likely to pop up, and not wasting time trying to account for every theoretical issue. At some point you've got to stop analyzing and start coding.
Now that the unthinkable has happened, however, I'm forced to reassess my position. At what point do we stop categorizing these events as unlikely catastrophes we don't need to worry about and start thinking of them as potential issues we should be accounting for? Should we just assume that anything can crash at any time? How much planning is too much?
What do you think, readers? How do you deal with these kinds of situations? Let me know in the comments!