Monday, 24 August 2009

Unleash the beasts!

Following my last post on caching patterns, I was asked a question about how to scale cache repopulation. The question was "In a read through caching scenario... assuming your cached object takes 30 seconds to calculate, and you have 60 processes accessing the cached value every half a second each. This suddenly expires, resulting in 120 misses per second until the value is re-calculated and re-cached successfully, swamping your server with repopulation requests. Right?" Well, erm, yes. Yes it does. Within some of the more nerdy circles, this is often called the Dog Pile effect, and has brought many websites down.

When caching quick-to-create items, this rarely causes problems. However in the case of large data sets that take a while to calculate, this can prove "the straw that breaks the camel’s back".

From Wiktionary: v. to dogpile

  1. To jump on top of someone, usually in a group.
    2003, Nancy Holder, Buffy the Vampire Slayer: Chosen[3], ISBN 0743487923, page 657:
    A vampire got her around the neck from behind; then more, dogpiling her.
  2. To pile on or overwhelm, such as with criticism or praise.
    2005, Craig Spector, Underground[4], ISBN 0765306603, page 169:
    But this guy was serious, using online payment services and dogpiling her e-mail box within minutes, requesting expedited shipping.

You shouldn't really have anything in your system taking so long to repopulate. If you do, it's likely to indicate a more fundamental architectural problem, as whatever you’re caching does not have a high degree of disposability.

This is a situation where we have hundreds of workers looking for the same information, so we don’t want to swamp our data source with the requests for that same information.

What we want to do, in theory at least, is use a lock. OMG. I know. I said never use locks, but here we are using them. What gives?

Well, this is really an optimisation for a particularly unusual case. Here, cached items are immutable, so we have no side effects to worry about. The process for repopulating lists is just not very efficient. Locks CAN be used in a distributed system, but just very carefully (and normally on a singly responsible background process). For example, some data reconciliation processes may require locks on data, but by putting these processes on background workers, we try to ensure they do not block user activities.

You also need to be aware of the problems of using locks in your particular environment. (We're OK here as our write operations are idempotent).

In this case we would write a wrapper around the cache, to check on its expiry, and force only a single thread to repopulate the cache.

So in pseudo code...

function GET(key)
{
object = Cache.Get(key) //try and get the value from the cache
if (object is null)
{
lock //SERIALISE ACCESS
{
if (object is still null)
{
object = GetItemToCache() //takes 30 seconds
PUT(key, object, now + 15 minutes) //expires in 15 minutes time.
}
}
}
return object
}

function PUT(key, value, expires)
{
Cache.Insert(key, value, expires)
}
Now you still have processes queuing up at the point where access is serialised, so in the 30 seconds required to repopulate the cache, you have 120 processes per second blocking. This solves one problem but creates another.
We've offloaded the traffic from GetItemToCache() , but forced everyone to wait. So better, but still not too great. How can we improve this to avoid blocking even further? Well... we can temporarily re-cache the "expired" value!

Here, we know that our cached object takes (about) 30 seconds to repopulate, so we can estimate when the item will go "stale". When the data goes stale it expires. This allows users to continue to read stale data whilst the update is prepared. avoiding contention on the object loading process.

.NET's cache for example allows you to specify a callback, which runs when an item falls out the cache. Most caching platforms have a similar mechanism for handling dropped cache items.

Using ASP.NET expiry, callback functions are delegates that support the following signature:

void CacheItemRemoved(string key, object val, CacheItemRemovedReason reason);

So our newly refactored pseudocode is...

constant ItemKey = "Item_That_Goes_Stale"; //a key to identify the cached item
function GET()
{
object = Cache.Get(ItemKey) //try to get the item from the cache.
if (object is null)
{
lock //if we have a miss, we synchronise access to repopulate the cache here.
object = Cache.Get(key)
if (object is still null) //double check after a lock (another thread could have jumped in.)
{
if (object is null) POPULATE_CACHE(); //If it's not in the cache.. cache it.
return object
}
end lock
}
return object
}

function PUT(value, expires)
{
//Here we add a callback to retain the value in the cache in the case of expiry.
//ASP.NET will call MAKE_STALE just before the item is dropped from the cache.
Cache.Insert(ItemKey, value, expires - 30 seconds, MARK_STALE);
}

function MARK_AS_STALE(string key, object val, CacheItemRemovedReason reason)
{
if (reason for removal is Removed or Underused) then return;
// do no re-cache the item if it's removed (written over) or underused.
//cache is stale... repopulate it straight away... Now if someone calls GET in the next 30 seconds they will get the old value.
Cache.Insert(ItemKey, value); //this value will be removed when the following POPULATE_CACHE call replaces it.
POPULATE_CACHE(); //REPOPULATE THE CACHE
}

function POPULATE_CACHE()
{
object = GetItemToCache() //takes 30 seconds
PUT(key, object, now + 15 minutes) //expires in 15 minutes time.
}

If you are not using a platform which supports this type of callback pattern, you can use the cache itself to check for staleness by caching a "still fresh" flag for an object which expires shortly before the real item. You can then check the cache for both the real item and the "still fresh" item. If the "still fresh" flag is not in the cache, we know the real item is stale. This does require a double read of the cache.

Either way, your application should remain much more responsive and stable if this technique is employed to freshen the cache ahead of time. Again, I must say this is an optimisation that is hopefully not needed too often. If you’ve designed your systems around distributed architectural principles, you should never need to do this. The vast majority of the time, the standard read/write, through/behind caching techniques should be more than enough to achieve acceptable system responsiveness.

The central principle and widespread use of caching in large scale web applications, and the new post-relational NoSql databases (used in the likes of Amazons Web service platform and Microsoft Azure Queues), lead to some interesting and common problems with data consistency which I will start to look at in my next post.

Friday, 7 August 2009

Warning: This internet may contain traces of nuts

To summarise my last few posts, you should now have at least a basic understanding of how the web is put together (DNS, TCP/IP, HTTP, HTML) and the problems you are likely to encounter when writing code for distributed systems. You should also have an idea of the major architectural constraints to consider when creating robust code in this type of environment....

  • Atomicity - Write atomic operations that pass or fail in their entirety.
  • Statelessness - Maintain a high degree of disposability
  • Concurrency - Avoid using locks wherever possible.
  • Idempotence - Write code that is repeatable and has no side effects

Great. We can now write a working distributed application. However, while this application may be effective, it may not be the most responsive. All this data still has to move all over some very s_l_o_w wires. In all distributed applications, there is a rule of thumb that you should remember. Data is always more expensive to move than it is to store. You only have to look at mobile phone operators data tariffs to see that.

So, how do we stop moving stuff around unnecessarily? Well, the answer is actually modelled directly on the world of squirrels. Well maybe not directly. Both squirrels and distributed applications use caches for storing stuff around their domain until it's needed.

In practical terms most caches have interfaces like a hash table, so you effectively have a big bucket in which you can store objects and identify them using a key. You can add an item to a cache by calling PUT(key, objectToCache) and retrieve items by calling GET(key)

Most caches are of limited size, so we have to decide how to throw things away if the cache is full and we want to add new items. This is normally done by popularity. So those items "hit" the most stay in the cache, and less popular items are thrown away.

Also in common with squirrels, we have an issue with staleness. We don't want to be retrieving nuts only to find that they've gone all mouldy. Yuk! So most caches also support some sort of Data Expiration strategy. These can be defined in a number of ways, though normally they tend to support sliding and absolute expiry policies. The former is where you specify after what period of inactivity you want cached items to expire. i.e. "If this isn't used for 5 minutes throw it away". The latter, absolute expiry is just the same as a use-by date..."this data is no good after midnight on wednesday."

Now an important thing about caches, is that you should not make any assumptions about their consistency. They could have been pillaged by some other hungry squirrel in the middle of the night without your knowledge.

As the cache makes decisions about how to optimise its use, it is quite possible that you can call PUT(itemA, myItem) and then call GET(itemA) and you get NULL back.

In technical jargon...
If you call GET(itemA) and it returns myItem, this is called a "hit"
If you call GET(itemA) and it returns NULL this is called a "miss".

Caches come in all shapes and sizes:
In-Memory Caches, Distributed Caches, Proxy caches, and Large Area Caching Networks , but they all work in more or less the same way.

So how do we make best use of a cache? Essentially, it's all to do with understanding the cost of a miss. You take the slowest operations.

There are effectively 4 caching patterns that you see on a day to day basis: Read Through, Write Through, Read Ahead and Write Behind.

For reading, it's all about how you handle a cache miss. Using the Read Through pattern in the case of a cache miss, you query your data source, populate the cache, and continue. Next time you get a hit.

With Read Ahead you should never get a miss. You query your data source and populate the cache before any requests are expected. But you may be holding things in memory that are not required.

Populating a cache on a write is more unusual outside distributed applications, but can again boost performance given the right set of conditions. Using Write Through caching, whenever a user submits data, this data is cached as it is written to the data source, speeding up any following reads without having to query the data source again.

And finally, Write Behind. Now, this one is seen particularly often in distributed systems. Most of the "post-relational" databases out there use this pattern, in which the cache is firstly committed to, then some other process writes to the data source at some unspecified later time. This has the advantage of handling writes very very quickly, and leaves the complicated business of reconciling data until later. However this means you have to tolerate a degree of inconsistency across nodes, albeit momentary in many cases.

So what pattern should I be using? Well, as usual there is no straightforward answer. "It all depends".

Read-Ahead gives a speed advantage over Read-Through, but can only be used if you can predict which cache items are going to be needed in the future. If your guess is accurate, refresh-ahead means reduced latency and no added overhead. The poorer your guess is, the worse your performance will be (as more unnecessary requests are sent to your data source). This could even result in worse overall performance if your data source falls behind with its request processing backlog. Read Ahead caches are also more expensive to initialise.

If Write-Behind caching can be used, it can allow much lower latency and higher throughput than Write-Through caching. Additionally Write-Behind caching can sometimes improve the performance of the data source (as there are less writes, and they can be delayed until resources are available).

So, caching can be used to improve the efficiency of distributed systems. You often see caching used like some kind of silver bullet for speed, but like any form of optimisation, it should be viewed as such. The trade off for speed is use of more resources (and management of those resources). More components mean more complexity, but applied carefully, caching can greatly improve the responsiveness of an application, especially a distributed one.