Monday, 24 August 2009

Unleash the beasts!

Following my last post on caching patterns, I was asked a question about how to scale cache repopulation. The question was "In a read through caching scenario... assuming your cached object takes 30 seconds to calculate, and you have 60 processes accessing the cached value every half a second each. This suddenly expires, resulting in 120 misses per second until the value is re-calculated and re-cached successfully, swamping your server with repopulation requests. Right?" Well, erm, yes. Yes it does. Within some of the more nerdy circles, this is often called the Dog Pile effect, and has brought many websites down.

When caching quick-to-create items, this rarely causes problems. However in the case of large data sets that take a while to calculate, this can prove "the straw that breaks the camel’s back".

From Wiktionary: v. to dogpile

  1. To jump on top of someone, usually in a group.
    2003, Nancy Holder, Buffy the Vampire Slayer: Chosen[3], ISBN 0743487923, page 657:
    A vampire got her around the neck from behind; then more, dogpiling her.
  2. To pile on or overwhelm, such as with criticism or praise.
    2005, Craig Spector, Underground[4], ISBN 0765306603, page 169:
    But this guy was serious, using online payment services and dogpiling her e-mail box within minutes, requesting expedited shipping.

You shouldn't really have anything in your system taking so long to repopulate. If you do, it's likely to indicate a more fundamental architectural problem, as whatever you’re caching does not have a high degree of disposability.

This is a situation where we have hundreds of workers looking for the same information, so we don’t want to swamp our data source with the requests for that same information.

What we want to do, in theory at least, is use a lock. OMG. I know. I said never use locks, but here we are using them. What gives?

Well, this is really an optimisation for a particularly unusual case. Here, cached items are immutable, so we have no side effects to worry about. The process for repopulating lists is just not very efficient. Locks CAN be used in a distributed system, but just very carefully (and normally on a singly responsible background process). For example, some data reconciliation processes may require locks on data, but by putting these processes on background workers, we try to ensure they do not block user activities.

You also need to be aware of the problems of using locks in your particular environment. (We're OK here as our write operations are idempotent).

In this case we would write a wrapper around the cache, to check on its expiry, and force only a single thread to repopulate the cache.

So in pseudo code...

function GET(key)
{
object = Cache.Get(key) //try and get the value from the cache
if (object is null)
{
lock //SERIALISE ACCESS
{
if (object is still null)
{
object = GetItemToCache() //takes 30 seconds
PUT(key, object, now + 15 minutes) //expires in 15 minutes time.
}
}
}
return object
}

function PUT(key, value, expires)
{
Cache.Insert(key, value, expires)
}
Now you still have processes queuing up at the point where access is serialised, so in the 30 seconds required to repopulate the cache, you have 120 processes per second blocking. This solves one problem but creates another.
We've offloaded the traffic from GetItemToCache() , but forced everyone to wait. So better, but still not too great. How can we improve this to avoid blocking even further? Well... we can temporarily re-cache the "expired" value!

Here, we know that our cached object takes (about) 30 seconds to repopulate, so we can estimate when the item will go "stale". When the data goes stale it expires. This allows users to continue to read stale data whilst the update is prepared. avoiding contention on the object loading process.

.NET's cache for example allows you to specify a callback, which runs when an item falls out the cache. Most caching platforms have a similar mechanism for handling dropped cache items.

Using ASP.NET expiry, callback functions are delegates that support the following signature:

void CacheItemRemoved(string key, object val, CacheItemRemovedReason reason);

So our newly refactored pseudocode is...

constant ItemKey = "Item_That_Goes_Stale"; //a key to identify the cached item
function GET()
{
object = Cache.Get(ItemKey) //try to get the item from the cache.
if (object is null)
{
lock //if we have a miss, we synchronise access to repopulate the cache here.
object = Cache.Get(key)
if (object is still null) //double check after a lock (another thread could have jumped in.)
{
if (object is null) POPULATE_CACHE(); //If it's not in the cache.. cache it.
return object
}
end lock
}
return object
}

function PUT(value, expires)
{
//Here we add a callback to retain the value in the cache in the case of expiry.
//ASP.NET will call MAKE_STALE just before the item is dropped from the cache.
Cache.Insert(ItemKey, value, expires - 30 seconds, MARK_STALE);
}

function MARK_AS_STALE(string key, object val, CacheItemRemovedReason reason)
{
if (reason for removal is Removed or Underused) then return;
// do no re-cache the item if it's removed (written over) or underused.
//cache is stale... repopulate it straight away... Now if someone calls GET in the next 30 seconds they will get the old value.
Cache.Insert(ItemKey, value); //this value will be removed when the following POPULATE_CACHE call replaces it.
POPULATE_CACHE(); //REPOPULATE THE CACHE
}

function POPULATE_CACHE()
{
object = GetItemToCache() //takes 30 seconds
PUT(key, object, now + 15 minutes) //expires in 15 minutes time.
}

If you are not using a platform which supports this type of callback pattern, you can use the cache itself to check for staleness by caching a "still fresh" flag for an object which expires shortly before the real item. You can then check the cache for both the real item and the "still fresh" item. If the "still fresh" flag is not in the cache, we know the real item is stale. This does require a double read of the cache.

Either way, your application should remain much more responsive and stable if this technique is employed to freshen the cache ahead of time. Again, I must say this is an optimisation that is hopefully not needed too often. If you’ve designed your systems around distributed architectural principles, you should never need to do this. The vast majority of the time, the standard read/write, through/behind caching techniques should be more than enough to achieve acceptable system responsiveness.

The central principle and widespread use of caching in large scale web applications, and the new post-relational NoSql databases (used in the likes of Amazons Web service platform and Microsoft Azure Queues), lead to some interesting and common problems with data consistency which I will start to look at in my next post.

No comments:

Post a Comment