When a fundamental shift happens in the world, it usually has many interrelated causes. It is nearly impossible to determine, and even harder to get agreement on a single root cause. Consider the competing creation mythologies around the invention and inventors of just about anything important.

What is usually a lot easier to see is the event that represents the turn of a trend from fringe thing or toy, to obvious inevitability.

Back in 1997, I was working for a startup writing serious software, so of course we had Sun SPARCs running Solaris, Alphas running Tru64, as you did. I was running Linux at home, as you did (I still do). And despite knowing the incontrovertible fact that Solaris and Tru64 on serious RISC CPUs was how serious software was written and deployed, it was also getting pretty obvious that Linux on commodity PCs could do the same sorts of things. Maybe it was a bit slower, but it was getting faster more quickly.

Eventually, the only differences (besides seriousness) were that the big iron could have 64 CPUs in a single system (this used to be an astounding number), and it had more RAS (Reliability, Availability and Serviceability) features.

But maybe you could get all those CPUs (and more) by putting together a bunch of PCs? And so people started building systems with clusters of PCs running Linux, giving them cute names like “Beowulf Clusters”.

Beowulf Clusters

Some kids at Stanford even named their cluster-hosted application “Backrub”!


But still, if you were running a serious e-commerce site, you needed scalable gear with serious RAS features from a big vendor. You probably ran Oracle on it, and it was the center of the universe.


Let’s fast-forward to the year 1999. You’re a little anxious about Y2K, but things are going great. Your big iron is humming along, normal people are getting comfortable buying and selling stuff online, and your stock price is soaring.

And then your pager goes off. The impossible has happened: the big iron is down! Your site is all over the front pages again, but not all press is good press. Your vendor has their engineers swarming your datacenter trying to get you back up, but the downtime persists.

Since the systems are proprietary, you’re totally reliant on the vendor for help, and your NDA with them prevents you from speaking publicly. In the resulting vacuum, people speculate about a cache controller bug in the CPU or an OS bug. Other rumors suggest sysadmin error or failure of capacity planning. None of that matters. What does is that high end RAS features you paid so much for have not saved you from your SPOF (Single Point of Failure). Your stock price is no longer soaring.

After 22 hours of downtime, you’re finally back up. For now. This scenario will play out 10s of times over the next year or two. And many of your friends at similar companies with similar architectures have similar stories.

Clearly you have to do something. Vendor promises aside, even the most expensive, most reliable hardware still fails eventually. And even the most scalable single servers only scale as far as the vendor had the imagination to design them to. So you need more than one server. You need cluster-aware applications that can survive individual servers failing, and can add capacity simply by adding more servers. And once you have all that, why are you paying 20x for servers that are at best 2-3x more reliable? Looks like the kids building their toys with funny names had the right idea all along.

And you’re not the only one making the shift. Most of the big Wall Street companies are doing the same, in part because they saw what happened to you!

Now here we are in 2013. The TOP500 list of supercomputers is dominated by big clusters of commodity servers running Linux. Finding Sun, Digital or HP big iron systems in a modern Internet datacenter is about as likely as finding a System/360 mainframe.

What you will still find are expensive, big iron, highly proprietary networking equipment, with all the costs and disadvantages that brings.


The economics of buying equipment and managing equipment demand that this changes as datacenters continue to grow at a blistering pace.

What event will we look back on as catalyzing the transition from big-vendor proprietary switches to flexible, open, Linux-based switches on commodity hardware? I don’t know yet, but it sure is exciting to be in the middle of it!