The Gresham’s law of Amazon Web Services · Fazal Majid's low-intensity blog

In the bad (good?) old days when currency’s worth was established by the amount of gold or silver in coinage, kings would cut corners by debasing currency with lead, which is almost as dense as gold or silver. In the New World, counterfeiters debased gold coins with platinum, which was first smelted by pre-columbian civilizations. Needless to say, the fakes are now worth more than the originals.

The public was not fooled, however, and found ways to test coins for purity, including folkloric ones like biting a coin to see if it is made of malleable gold, rather than harder metals. People would then hoard pure gold coins, and try to rid themselves of debased coins at the earliest opportunity. This led to Gresham’s Law: bad money drives out good money in circulation.

After a year of using Amazon Web Services’ EC2 service at scale for my company (we moved to our own servers at the end of 2011), I conjecture there is a Gresham’s Law of Amazon EC2 instances – bad instances drive out good ones. Let me elaborate:

Amazon EC2 is a good way to launch a service for a startup, without incurring heavy capital expenditures when getting started and prior to securing funding. Unfortunately, EC2 is not a quality service. Instances are unreliable (we used over 80 instances at Amazon, and there was at least one instance failure a week, and sometimes up to 4). Amazon instances have poor disk I/O performance that makes them particularly unsuitable to hosting non-trivial databases (EBS is even worse, and notoriously unreliable).

Performance is also inconsistent—I routinely observed “runt” m1.large instances that performed half as well as the others. We experienced all sorts of failure modes, including disk corruptions, disks that would block forever without timing out, sporadic losses of network connectivity, and many more. Even more puzzling, I would get 50% to 70% failure rate on new instances that would not come up cleanly after being launched.

Some of this is probably due to the fact we use an uncommon OS, OpenSolaris, that is barely supported on EC2, but I suspect a big part of this is that Amazon uses low-end commodity parts, and does not proactively retire failed or flaky hardware from service. Instances that have the bad luck of being assigned to flaky hardware are more likely to fail or perform poorly, and thus more likely to be be destroyed, released and a new one reassigned in the same slot. The inevitable consequence of this is that new instances have a higher likelihood of being runts or otherwise defective than long-running ones.

One work-around is to spin up a large number of instances, test them, and destroy the poor-performing ones. AWS runts are usually correlated with slower CPU clock speeds, as older machines would be running older versions of the Xen hypervisor Amazon uses under the hood, have less cache, slower drives and so on. Iterating through virtual machines as if you are picking melons at a supermarket is a slow and painful job, however, and even their newer machines have their share of runts. We were trying to keep only machines with 2.6 or 2.66GHz processors, but more than 70% of the instances we were getting assigned were 2.2GHz runts, and it would usually take creating 5 or 6 instances on average to get a non-runt.

In the end, we migrated to our own facility in colo, because Amazon’s costs, reliability and performance were just not acceptable, and we had long passed the threshold beyond which it is cheaper to own than rent (I estimate it at $5,000 to $10,000 per month Amazon spend, depending on your workload). It is not as if other cloud providers are any better—before Amazon we had started on Joyent, which supports OpenSolaris natively, and their MTBF was in the order of 2 weeks, apparently because they replaced their original Sun hardware with substandard Dell servers and had issues with power management C-states in the Dell server BIOS.

The dirty secret of cloud services is that there is no reliable source of information on actual performance and reliability of cloud services. This brings out another economic concept, George Akerlof’s famous paper on the market for lemons. In a market where information asymmetry exists, the market will eventually collapse in the absence of guarantees. Until Amazon and others offer SLAs with teeth, you should remain skeptical about their ability to deliver on their promises.