Thursday 23 July 2009

For Sale: One slightly worn RDBMS

Web and cloud apps have to be able to scale as close to linearly as possible, in a hostile environment. Network topology and number of users can fluctuate wildly. Contrast this with a desktop application, which just has to handle a single user. Or with a client/server app, which may have to handle a couple of hundred users usually on the same network. If millions of users start using our distributed system, the experience of each should be no different from what it would be if only 5 or 6 people were using it.

While web developers are used to dealing with concurrency, either consciously or just by knowing that calling Application.Lock() from VBScript will cripple their app, it is also becoming a wider issue for desktop developers, as chip manufacturers squeeze more and more cores on to their chips.

The arch-enemy of concurrency is shared data. Any data that two people have to update can cause concurrency problems. This is actually made slightly simpler in web programming by virtue of HTTP being stateless, and is one reason why the functional style of programming as seen on the web is becoming more popular in other applications.

If we can maintain the qualities of statelessness and atomicity in our operations, we are already well on our way to make our code safe to run concurrently. It's only when someone attempts to force a stateful model that you run into problems.

If shared data is the enemy of concurrency, locks are its atomic bombs. Locks work by serialising access to shared data, so that only one process / thread / fibre can write shared data at any one time. This means that anyone else just has to wait in line. Serialisation is the arch enemy of scalability.

So, NEVER hold editable shared data in your application layers. I will of course break this absolute rule in a future post for grandstanding purposes, but I'm allowed to - It's my blog! And of course, it's to illustrate a good point.

This issue doesn't tend to be a problem for those accustomed to web programming, since most web frameworks either enforce, or heavily imply, avoidance of global state. In most real systems today, shared state is handled by a backend RDBMS, such as Oracle or SQL Server, which offer a gigantic improvement over file-based DBs such as Access or Paradox. These do a very good job, at the moment, of handling medium to large scale apps. However, if we scale to the web, where fast global availability is required, then "Houston, we have a problem." We realise all we’ve done is offloaded the problem further down the chain. Because these client-server style databases have transactions, and therefore locks, embedded firmly at the heart of their architecture, they will not scale globally. You can partition them; you put them on ever larger machines, or even clusters of machines; but they will always require a global lock to perform a commit.

So if RDBMS' such as SQL Server don't scale well, then what does?

Whilst still a drop in the ocean in terms of commercially enterprise databases, Databases based on Distributed hash tables, such as CouchDB, Google’s BigTable, Voldemort, AWS SimpleDB and similar, are gaining a lot of traction.

Microsoft’s forthcoming Azure Table Storage falls in to this category too, unlike their slightly clumsy and poorly positioned "old man in a new hat" - SQL Azure Services.

These types of database characteristically emphasise Decentralization, Scalability and Fault Tolerance, with nodes continuously joining, leaving, and failing within the system.

Sound familiar? Well these are all databases designed to work not only on the web, but in accordance with the fundamental architecture of the web. They emphasise availability over transactional consistency which can cause serious headaches for people stuck in an RDBMS world view. (The idea that the database will sometimes return the wrong result is horrifying to them.)

Scalable Transactions are EXTREMELY complex (actually impossible, as previously discussed) in a distributed system. So, support for transactions in Distributed Databases is rare to limited. If you look at Amazon's SimpleDB or Azure Table Store, if you perform a "BatchPut" or "EntityGroupUpdate" respectively, these do succeed or fail as a unit. However, they are subject to limitations well below the expectations of developers who are used to developing against RDBMS systems. Maximum request sizes of 4MB, or number of items in a transaction (25 for SDB), seem almost laughable. However these distributed document style databases are a fundamentally different type of application from the RDMBS.

Unlike SQL Server or Oracle however, they are reliable, cheap, and MASSIVELY scalable. Therefore, while they account for a tiny fraction of the database market at the moment, they do offer a very serious cost-based competitive advantage over large scale installations of RDBMS systems. I will write a future post about lock free data management patterns and post-relational DBs in the future, but once again, for now the key takeaway for concurrency is:

Locks are mindblowingly expensive in a distributed system.

And they get more expensive the more they are used. If you really MUST use them, stand well back and wear very dark sunglasses.

Next and last up: Idempotence

No comments:

Post a Comment