Friday, 7 August 2009

Warning: This internet may contain traces of nuts

To summarise my last few posts, you should now have at least a basic understanding of how the web is put together (DNS, TCP/IP, HTTP, HTML) and the problems you are likely to encounter when writing code for distributed systems. You should also have an idea of the major architectural constraints to consider when creating robust code in this type of environment....

  • Atomicity - Write atomic operations that pass or fail in their entirety.
  • Statelessness - Maintain a high degree of disposability
  • Concurrency - Avoid using locks wherever possible.
  • Idempotence - Write code that is repeatable and has no side effects

Great. We can now write a working distributed application. However, while this application may be effective, it may not be the most responsive. All this data still has to move all over some very s_l_o_w wires. In all distributed applications, there is a rule of thumb that you should remember. Data is always more expensive to move than it is to store. You only have to look at mobile phone operators data tariffs to see that.

So, how do we stop moving stuff around unnecessarily? Well, the answer is actually modelled directly on the world of squirrels. Well maybe not directly. Both squirrels and distributed applications use caches for storing stuff around their domain until it's needed.

In practical terms most caches have interfaces like a hash table, so you effectively have a big bucket in which you can store objects and identify them using a key. You can add an item to a cache by calling PUT(key, objectToCache) and retrieve items by calling GET(key)

Most caches are of limited size, so we have to decide how to throw things away if the cache is full and we want to add new items. This is normally done by popularity. So those items "hit" the most stay in the cache, and less popular items are thrown away.

Also in common with squirrels, we have an issue with staleness. We don't want to be retrieving nuts only to find that they've gone all mouldy. Yuk! So most caches also support some sort of Data Expiration strategy. These can be defined in a number of ways, though normally they tend to support sliding and absolute expiry policies. The former is where you specify after what period of inactivity you want cached items to expire. i.e. "If this isn't used for 5 minutes throw it away". The latter, absolute expiry is just the same as a use-by date..."this data is no good after midnight on wednesday."

Now an important thing about caches, is that you should not make any assumptions about their consistency. They could have been pillaged by some other hungry squirrel in the middle of the night without your knowledge.

As the cache makes decisions about how to optimise its use, it is quite possible that you can call PUT(itemA, myItem) and then call GET(itemA) and you get NULL back.

In technical jargon...
If you call GET(itemA) and it returns myItem, this is called a "hit"
If you call GET(itemA) and it returns NULL this is called a "miss".

Caches come in all shapes and sizes:
In-Memory Caches, Distributed Caches, Proxy caches, and Large Area Caching Networks , but they all work in more or less the same way.

So how do we make best use of a cache? Essentially, it's all to do with understanding the cost of a miss. You take the slowest operations.

There are effectively 4 caching patterns that you see on a day to day basis: Read Through, Write Through, Read Ahead and Write Behind.

For reading, it's all about how you handle a cache miss. Using the Read Through pattern in the case of a cache miss, you query your data source, populate the cache, and continue. Next time you get a hit.

With Read Ahead you should never get a miss. You query your data source and populate the cache before any requests are expected. But you may be holding things in memory that are not required.

Populating a cache on a write is more unusual outside distributed applications, but can again boost performance given the right set of conditions. Using Write Through caching, whenever a user submits data, this data is cached as it is written to the data source, speeding up any following reads without having to query the data source again.

And finally, Write Behind. Now, this one is seen particularly often in distributed systems. Most of the "post-relational" databases out there use this pattern, in which the cache is firstly committed to, then some other process writes to the data source at some unspecified later time. This has the advantage of handling writes very very quickly, and leaves the complicated business of reconciling data until later. However this means you have to tolerate a degree of inconsistency across nodes, albeit momentary in many cases.

So what pattern should I be using? Well, as usual there is no straightforward answer. "It all depends".

Read-Ahead gives a speed advantage over Read-Through, but can only be used if you can predict which cache items are going to be needed in the future. If your guess is accurate, refresh-ahead means reduced latency and no added overhead. The poorer your guess is, the worse your performance will be (as more unnecessary requests are sent to your data source). This could even result in worse overall performance if your data source falls behind with its request processing backlog. Read Ahead caches are also more expensive to initialise.

If Write-Behind caching can be used, it can allow much lower latency and higher throughput than Write-Through caching. Additionally Write-Behind caching can sometimes improve the performance of the data source (as there are less writes, and they can be delayed until resources are available).

So, caching can be used to improve the efficiency of distributed systems. You often see caching used like some kind of silver bullet for speed, but like any form of optimisation, it should be viewed as such. The trade off for speed is use of more resources (and management of those resources). More components mean more complexity, but applied carefully, caching can greatly improve the responsiveness of an application, especially a distributed one.

No comments:

Post a Comment