Engineering four 9s

Four 9s is a term thrown around in IT circles by people who like to gloat about all the expensive equipment they have purchased. For TempWorks it means that we guarantee that our hosting services are available for use by our customers 99.99% of the year. Four 9s boils down to about 53 minutes of unplanned downtime for a given year. A footnote to our guarantee is that it does not include planned downtime for system maintenance.

Personally I am a pessimist, at least that is what my wife tells me. I like calling it being overly cautious and always expect bad things to happen. Okay, same thing. Over my years here at TempWorks I have overseen the progression of our hosting services division grow from a three computers to today where we have over 60 servers. In the beginning we didn’t have the budget to worry about uptime and redundancy. If a computer went down, the Internet or power fizzled out our customers went offline. Luckily that didn’t happen too often. Maybe a few times a year mostly because of our buildings notorious power supply conditions. We’ll just say I didn’t enjoy the coming spring/summer storm seasons.

paul_czywczysnki
Taken circa 2002, our entire hosting solution then took up about 8Us in that rack I am standing next to.

Through the years to today we have tried different approaches to redundancy and achieving four 9s. I am not going to document every attempt we tried in the past. I wanted to keep this post short and too the point. Our efforts can be broken down into three bullet points.

Redundancy:

I get teased quite a bit around the office about buying two of everything. I can take it, it makes me sleep soundly at night. We look at every failure point and doubled up on it. Multiple data centers, multiple Internet connections from separate ISPs, multiple firewalls in a high availability configuration, multiple RDP/ICA load balancers, multiple web farms for our web based products, and finally, multiple mirrored SQL Servers installations including our production 2-node fail-over cluster.

Power Stability:

As I mentioned before power for our building was iffy on the best days, brown-outs were a constant. Several years we made the decision to invest some serious capital to fix the situation. We ended up installing a solution from APC that included our own power generator and direct power feed from the electrical grid. The only thing I need to worry about in a power-outage is the ability to keep the generator filled with diesel and yes, I do have emergency refilling contracts. Now I look forward to storm season. The power goes out for the building but our data center hums along like nothing happened. I also have distributed a few protected power outlets through our office suite to a few key offices that would need to keep operational, like our payroll processing division.

Internet Stability:

I can never claim that we have Internet redundancy figured out. This issue has been the most troublesome to get right. The solution we have in place now seems sound and has been working well since we brought it online last summer. Currently we have a fiber tap on the Metropolitan Optical Ethernet loop as does our ISP. That is our main pipe that terminates into our Cisco 7200. For redundancy we have two bridged T1s terminated at our Cisco 7200 and at another ISP in a different city. We’re using BGP routing between the two ISPs. Our secondary data center is on a different pipe altogether and we keep another T1 active at another location just incase we need an emergency pipe to the Internet and all else has failed.

IMG_0890 IMG_0926  
Some of our routers and firewalls
One of two rows of hosting servers and UPS modules. Also me holding an award from APC for spending the most money that year. Actually it was an uptime certification.
IMG_0780    
The shiny new generator
   

Four 9s is hard and isn’t cheap. Its pretty much two or more of everything and takes about four times the man hours to get it working right. It’s a never ending battle to get unanticipated downtime to nil. I am sure as time goes on we will be tweaking our set up. As it is now it works and works great. We haven’t had an outage in about 9 months and that downtime was only about 20 minutes. We were still trying to get our BGP routes right :)

Enterprise 12r7 Release Notes

New Features

  • New dynamic searching engine.
  • End-users can create their own custom searches and share them.
  • Enhanced system for managing customer default values and rate sheets.
  • Enhanced Resume Parser engine.
  • Required documents management.
  • Invoice merging enhancements.
  • Evaluation management.
  • Improved test score management.
  • Grids can now be customized on a per user basis.
  • End-users can now manage their own drop-down value lists without the need to call support.
  • Deep integration with Trak-1 Background Screening. No more double data entry and real-time alerts on background report documents.

  • Many bug fixes since Enterprise 12r6…


See the product page here.

WPF and Terminal Services, a mix not made in heaven

When we first architected Enterprise Terminal Services performance wasn’t at the top of our priority list. Actually one our goals was to get rid of the requirement of our software to run Terminal Services in a distributed environment. It sucks to have to pay MS license fees twice. Once for the Windows CAL and then for the TS CAL. I spent a lot of time researching a distributed framework to build Enterprise on. Luckily I was fortunate to see Rocky Lhotka at a local event and heard about his CSLA framework. It fulfilled all our requirements and after reading his book I was sold. Fast forward a few years and we have a desktop client that can run locally and talk to a remote data store using not much more bandwidth than a similar browser application. Plus we get the benefit of a full trust application on the computer. So yes, we have a distributed application that run across the Internet with pretty good speed and we find ourselves still needing Terminal Services sometimes. The one factor I didn’t account for, and shame on me for being in this industry over 15 years and not seeing it, is companies like to run really old hardware.

We designed Enterprise using the latest and greatest from Microsoft. Sometimes called the bleeding edge of technology. Now for developers that is great because you get to learn about stuff before it gets mainstream and old school. The problem is that our new fangled WPF UI application has some moderately hefty hardware requirements. Of course as time goes on we have Moore’s Law and business equipment upgrade cycles. Sometimes though they don’t kick in soon enough for our liking. Companies run Terminal Services to get a few more years out of their user’s desktop or the user doesn’t even have a full desktop and runs a WinTerm. Each company has their agenda and we as a company have to adapt our software to run in our market space. As Enterprise rolled out the door and into customers hands we began to learn that users sometimes didn’t have the horsepower to Enterprise and all the WPF goodness we put into it. Now I am not saying you need $5,000 gaming machine to get decent performance but something with moderately reasonable specs. In today’s hardware spending $400 at Best Buy will get you a machine that will run Enterprise pretty well. But when you’re dealing with a company that hasn’t upgraded their user machines in the last 5 to 6 years you might as well try to be running Enterprise on an Intel 486. So we have come full circle and have to deal with Terminal Services again.

Late last year Aaron was tasked with redesigning our UI and take Terminal Services into account. This meant detecting when we’re running on Terminal Services and removing all our transitions. Transitions are a bad thing when it comes to Terminal Services because as we fade and side UI elements around Terminal Services needs to repaint the screen on the client end. You then end up with screen tearing and horrible lag because Terminal Services is try to catch up as you slid that panel neatly out of the way. By getting rid of all the WPF eye candy we end up with a snappy UI. We did all this work and were proud of ourselves until about month ago. Enterprise was being installed at a customer that was self-hosted and running Terminal Services. At the time we thought no big deal because we knew we had Enterprise tuned for Terminal Services sessions. What we didn’t anticipate was their Terminal Services hardware adequate.

Back in the Access days of TempWorks we would recommend  a two CPU machine with at least 2Gb of RAM. That was usually enough to host about 25 users. When we installed Enterprise on a machine with this config we could only get 5 to 10 users comfortably on. After some pondering I came to the conclusion that we were CPU bound. Well, that was easy to figure out because the CPUs maxed at 100% utilization when we approached 10 users, the question was why. Normally Enterprise doesn’t use much more CPU cycles then any other business application. We do steal a few more then necessary when we do our fancy WPF transitions but nothing too off the scale. But in Terminal Services all those transitions are turned off so that wasn’t it. We came to the conclusion that it was a combination of issues. First of all Enterprise is very multithreaded. Almost everything we do is fired off in it own thread so the UI doesn’t freeze. Data coming and going is threaded off. Not a big deal when you’re the only person on a dual or quad core CPU machine but a big deal when you’re sharing an old single core CPU server with many users. Also compound the fact that WPF is using software/CPU rendering because you don’t have access to a GPU in a Terminal Services session. So the CPUs were maxed out trying to keep up with all the threads but also trying to draw Enterprise.

To test the theory we went out and purchased a new Mac Pro desktop with 2 Quad Core Nehalem Xeons at 2.2 GHz. Since Nehalem's are Hyperthreaded we had 16 cores at our disposal. We also installed 10Gb of RAM to make sure we didn’t run out of memory and threw on Windows Server 2008 R2 beta running Terminal Services. Yes I know, a Mac running Windows. Anyway, to test I filled our training room with volunteers and had a few developers at their desktop all log in to this machine and start using Enterprise. I asked all the users to search, open records and navigate to different forms. I was relieved as I watched the server’s resource monitor. The server hardly broke a sweat. We had about 30 users logged in and 50 instances of Enterprise actively being used and the CPUs never broke 50% utilization. Better yet the CPU frequency hovered around 75% meaning the CPUs weren’t even at full power. Not bad for a $3,500 desktop.

I can rest easy again knowing Enterprise runs very well on Terminal Services as long as you have current hardware. To me that is fine because it is easier to convince a company to upgrade their few servers as opposed to upgrading dozens and dozens of desktops.

-Paul