Engineering four 9s

Four 9s is a term thrown around in IT circles by people who like to gloat about all the expensive equipment they have purchased. For TempWorks it means that we guarantee that our hosting services are available for use by our customers 99.99% of the year. Four 9s boils down to about 53 minutes of unplanned downtime for a given year. A footnote to our guarantee is that it does not include planned downtime for system maintenance.

Personally I am a pessimist, at least that is what my wife tells me. I like calling it being overly cautious and always expect bad things to happen. Okay, same thing. Over my years here at TempWorks I have overseen the progression of our hosting services division grow from a three computers to today where we have over 60 servers. In the beginning we didn’t have the budget to worry about uptime and redundancy. If a computer went down, the Internet or power fizzled out our customers went offline. Luckily that didn’t happen too often. Maybe a few times a year mostly because of our buildings notorious power supply conditions. We’ll just say I didn’t enjoy the coming spring/summer storm seasons.

paul_czywczysnki
Taken circa 2002, our entire hosting solution then took up about 8Us in that rack I am standing next to.

Through the years to today we have tried different approaches to redundancy and achieving four 9s. I am not going to document every attempt we tried in the past. I wanted to keep this post short and too the point. Our efforts can be broken down into three bullet points.

Redundancy:

I get teased quite a bit around the office about buying two of everything. I can take it, it makes me sleep soundly at night. We look at every failure point and doubled up on it. Multiple data centers, multiple Internet connections from separate ISPs, multiple firewalls in a high availability configuration, multiple RDP/ICA load balancers, multiple web farms for our web based products, and finally, multiple mirrored SQL Servers installations including our production 2-node fail-over cluster.

Power Stability:

As I mentioned before power for our building was iffy on the best days, brown-outs were a constant. Several years we made the decision to invest some serious capital to fix the situation. We ended up installing a solution from APC that included our own power generator and direct power feed from the electrical grid. The only thing I need to worry about in a power-outage is the ability to keep the generator filled with diesel and yes, I do have emergency refilling contracts. Now I look forward to storm season. The power goes out for the building but our data center hums along like nothing happened. I also have distributed a few protected power outlets through our office suite to a few key offices that would need to keep operational, like our payroll processing division.

Internet Stability:

I can never claim that we have Internet redundancy figured out. This issue has been the most troublesome to get right. The solution we have in place now seems sound and has been working well since we brought it online last summer. Currently we have a fiber tap on the Metropolitan Optical Ethernet loop as does our ISP. That is our main pipe that terminates into our Cisco 7200. For redundancy we have two bridged T1s terminated at our Cisco 7200 and at another ISP in a different city. We’re using BGP routing between the two ISPs. Our secondary data center is on a different pipe altogether and we keep another T1 active at another location just incase we need an emergency pipe to the Internet and all else has failed.

IMG_0890 IMG_0926  
Some of our routers and firewalls
One of two rows of hosting servers and UPS modules. Also me holding an award from APC for spending the most money that year. Actually it was an uptime certification.
IMG_0780    
The shiny new generator
   

Four 9s is hard and isn’t cheap. Its pretty much two or more of everything and takes about four times the man hours to get it working right. It’s a never ending battle to get unanticipated downtime to nil. I am sure as time goes on we will be tweaking our set up. As it is now it works and works great. We haven’t had an outage in about 9 months and that downtime was only about 20 minutes. We were still trying to get our BGP routes right :)

0 comments:

Post a Comment