Thursday 1 October 2009

A river runs through it

Hopefully now, you will be becoming acclimatised to some of the best practices for designing large scale systems. Now that we've examined the "application layer" in a bit of detail (implement only Atomic, Idempotent, Stateless operations), we'll move on to look at the "back-end" - the data storage / state / persistence layer.

Fundamentally, all scalable architectures are about replication, be it caching, db replication, or clustering. They are about mitigating the problems of having one central authoritative system which simply does not scale. All because as the number of users and transactions grow, so does contention on a central resource. The only way to deal with this is to add replicas: replica applications, databases, machines, graphics, etc., whatever your application is built of.

Way back in the "DotCom Boom (tm)" days of the internet, Eric Brewer, chief scientist of a company called Inktomi, which ran Yahoo!’s search engine before a young upstart company called Google got their big break by displacing them, came up with an idea known as "Brewer’s Conjecture", or sometimes the "CAP theorem". It basically says that in a system that uses distributed data, you can choose two of the following (never all three):

Consistency, Availability, or Partition-tolerance

Having all three is impossible, as there is a delay that occurs during the creation of a replica. To deal with this delay, you can either lock access to each copy until all are in sync, or put up with the fact that these replicas may be out of synch with each other during the replication process. One of the most interesting demonstrations of this was what used to be known as the “Google Dance”: when Google pushed out their monthly index updates, search results across the globe would all go out of synch (nowadays they use much more frequent updates, and a lot more in the way of personalisation technologies, so you can’t really see this effect any more).

At the compute / application layer, problems of state management are largely solved by adopting the styles of programming discussed previously in this blog (Implement only Atomic, Idempotent, Stateless operations).

However, the most common backend state management system for most web applications is the Relation Database Management System, or RDBMS; usually MySQL, Oracle, or MS SQL Server. These databases provide partition tolerance (to a degree) and a consistent view of data, so logically they must be sacrificing availability.

In his most famous blog post, Werner Vogels, Chief Technology Officer of Amazon, takes the view (and I agree) that not only should we sacrifice consistency for the sake of Availability and Partition-tolerance, but that this is the only reasonable option as a system scales (as the uptime of a whole system is the product of the uptime of all its component parts).

Other people have different views. StackOverflow, a popular programming community site, have chosen to sacrifice availability. You often (in my case a lot! maybe I'm breaking it??) see the message "StackOverflow is currently offline for maintenance. Routine maintenance usually takes less than an hour. If this turns in to an extended outage, we will post details on the blog." This may cause them some interesting challenges as they grow, but certainly at the moment , all things considered, they will not necessarily lose any cold hard cash if their site is down for 20 minutes, whereas Amazon’s losses would be massive.

Buiried deeply in the small print of Amazon’s SimpleDB service, the documentation for the Amazon SimpleDB PutAttributes method has the following note:

"Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate GetAttributes or Query request (read) immediately after a DeleteAttributes or PutAttributes request (write) might not return the updated data."

Windows Azure’s Table Storage and both Amazon and Azures Queue Services also share this characteristic, and so cannot guarantee message ordering (FIFO).

This model of "Eventual Consistency" is also one of the core characteristics of the new breed of NoSQL databases appearing. And whilst this makes a lot of sense to many Web Developers who have spent their lives dealing with stateless cached data, it's probably the single most confusing part of distributed application development to a traditional RDBMS or Client/Server developer.

The very idea! Eugh! Yuk! You mean you can update data only for the system to "sometimes give you the wrong result". This makes RDBMS-centric developers feel very nauseous. But fear not! Eventual consistency is part of daily life. As I've said in a previous post, it's odd that the canonical example of the transaction is the bank transaction, when in fact this process operates in an eventually consistent manner (and is actually far more complicated than a transaction).

We've been indoctrinated by the Big RDBMS vendors, that only transactional write consistency will do. They commonly use the banking metaphor to imbue their products with an implied stable, trustworthy state that is simply not there.

People just do not need to know exactly when someone else does something, just what they've done, when they come to deal with it. The mother of all transactional systems, the stock-exchange, is eventually consistent. However a LOT of money is spent to get the inconsistency window as short as possible, for example, locating trading offices as close as possible to exchanges with high speed connections.

There is absolutely no doubt that dealing with an eventually consistent store is much more complicated than dealing with a central RDBMS, but as multi-tenant systems become the norm over the next 5-10 years, the RDBMS will look increasingly creaky as a core on which to build an application.

Over the next couple of posts I'll look in more detail at eventually consistent data stores and queues and some patterns commonly used to deal with it.