Scale This!: fallacies of network computing

Showing posts with label fallacies of network computing. Show all posts

Tuesday, 6 April 2010

The Byzantine Cake Problem

It's Dereks birthday! As is the normal case for these events, we "secretly" have a collection and the person who collects up the money, goes and "secretly" buys a cake for the birthday boy.

This was until my boss, Craig, sent this out to the office by email...

"Can someone go to shop please as I’m busy all afternoon."

This results in an interesting logical problem.

There is an implied condition in the phraseology he's used that the person that goes is therefore not busy all afternoon. If they're not and he doesn't know about it, they are admitting being covertly lazy!

Aside from this immediate conflict of interest, all of the listeners who received the message (which by the way is not guaranteed) have no way to discuss in public who is going to go to the shop, as the targets act as listeners also and must not be aware of any discussion.

Now every time someone leaves the room, there is no way to tell if they have gone to get the cake. Craig is already gone as he is apparently "busy", but so is the money. No-one can know for sure whether he, or an another unknown "shopper" has taken it.

(Incidentally, Derek who's birthday it is, has now just left and so potentially he's stolen the money and is away to buy a cake, or new car, for himself. A potential security problem here too.)
In the end, the money acts a flag, but without some form of centralised co-ordination everyone in the office can go as far as the point of contention, all standing in the queue at Sainsburys with cakes and no money, either waiting eternally for Craig to turn up, or a race condition if he does, resulting in abandoned cake purchases and dead bodies all over the place.

The question from a protocol design point-of-view (given the Two Generals problem and it's proof of impossibility) is... "Has Craig mitigated the uncertainty inherent in the communication channel to an acceptable degree?"

Update: Craig has returned to the office, put on his coat and left.

Update 2: He's back! With a cake!

Monday, 13 July 2009

"It Worked."

In my last post I briefly discussed statelessness. The key point about statelessness is that between HTTP calls, whether to fetch web pages, or to access web services or other HTTP based services, each part of the system should be capable of being restarted without causing the others to fail.

Ideally, services should require no state. There are ways of maintaining a "session state" in many web platforms, but these should very much be seen as an optimisation or workaround, rather than a first port of call. I will look at the various session state patterns.

One side effect of statelessness is that we do not really have the context of a conversation available to us, so all the state required to complete a transaction must be available during a single request/response exchange and have no dependencies outside this. This characteristic is referred to as "Atomicity", and normally produces a particularly interesting expression on the face a developers when realised for the first time. Usually it's a result of attempting to use a component that has adopted what I would call the "Barefaced Liar" pattern - all seemed fine when the code ran locally.

This is where a component is designed to expose a simple interface, common to UI development, but actually impossible to guarantee in a distributed environment. There are lots of examples of this but often it's a Distributed Transaction Coordinator of some sort - although session state providers, and WCF’s "Bi-Directional" channel, can cause similar nauseating effects.

So... we have no state available here in which to hold stateful conversations.

The mother of all conversations, in Software Circles at least, is the "2-Phase commit". This protocol ensures that when I ask for work to be done by say three co-operating participants, then either all of it is done, or else none is. The canonical example of this is the bank transaction. We want money to be taken out of account one and put in to another, but if one of those actions fails, we don't want to revert just that one action and pretend nothing untoward has happened. We don't want both (or neither!) of the accounts to have the money.

It essentially works like this. A coordinator asks each participant to do work and waits until it has received a confirmation reply from all participants.

The participants do their work up to the point where they will be asked to commit to it. Each remembers how to undo what they've done so far. They then reply with an agreement to commit their work if the work was done, or a "failed" message if they could not perform the work.

If the coordinator received an agreement message from all participants, then it sends a commit message to them all, asking them to commit their work. On the other hand, if the coordinator received a fail message from any of the participants, then it sends a "rollback" message to all, asking them to undo their work.

Then, it waits again for responses back from all participants, confirming they've committed or rolled back. Only when all have confirmed successful committal, does the co-ordinator inform whoever is interested, that the transaction has succeeded.

This is a staple pattern of Client Server Systems, particularly of the RDBMS kind. However, it is not guaranteed to work in a stateless system as conversations require context and so are not stateless. This breaks our previous advice about maintaining disposability. You cannot invisibly dispose of a participant in a transaction who is waiting to commit with no side effects.

So... Operations exposed on the web should all be Atomic. They should fail or pass as a single unit.

To a client/server developer this of course seems at first sight to be monumentally impractical in many real world scenarios. But in fact most activities of any real-world value cannot use transactions as we know them.

In a standard web services integration, the BookFlight(), BookHotel() and BookCar() methods in a BookHoliday() operation may well be run by completely different organisations. Each of these organisations are likely to have differing policies and data platforms.

I've always found it funny that the poster boy for the 2-phase commit, The Bank Account Transfer, in reality cannot - and does not - use the 2-phase commit protocol.

So what does it use? Well the answer lies in a fairly pungent concoction of asynchronous co-ordination, reconciliation and compensation that is definitely an acquired taste. I will come back to look at specific ingredients you can use in a later post but for now, the important thing to remember is...

At the service boundary, operations should fail or pass in their entirety.

Next up: Concurrency

Thursday, 2 July 2009

The state I am in

One of the fallacies of distributed computing states that "Network topology doesn't change."

Well, yes it does, particularly on the internet. Pieces of hardware and software are constantly being added, upgraded and removed. The network is in a constant state of flux. So how do you perform reliable operations in such a volatile environment?

Well, the answer lies in part with the idea of statelessness. HTTP is what’s known as a stateless protocol. For those coming from a non-distributed platform background, statelessness is one of the most misunderstood aspects of web development.

Partly this is due to attempts to hide statelessness by bolting on some - often clunky - state management system. This is because statelessness is often viewed as a kind of deficiency. Framework developers all over the world say, "Oh yuk. Statelessness. This is soOOo difficult to work with. I want my developers to be able to put a button on a form and say OnClick => SayHello(). It's 2009! I don't want to have to worry about state issues on a button click!".

If you're dealing with a Windows Forms style of application, your button is more or less the same “object" that is shown on the user’s screen; but in many web frameworks, a "button" and its on screen representation in a user’s browser may be thousands of miles apart. The abstraction of “a button” assumes a single object, with a state - but this is not the reality. What appears to be one button is in fact two buttons: a logical one on a server, and a UI representation in a browser, with a “click” message being sent between them. The concept of "a button" is a leaky abstraction, and the reason that it's leaky, is because it does not take in to account the network volatility that is a standard feature of the web.

So, what do I mean by statelessness?

When you use the web, it enlists the resources required to fulfil your request dynamically. Once the request has been completed, the various physical components that took part are disconnected, and neither know nor care anything for each other anymore. How sad :-(

Let's look at it in terms of a conversation,

Customer: Hello
Teller: Hello, what can I do for you today?
Customer: I'd like to know my bank balance please.
Teller: Can I have your account number please?
Customer: It's 09-F9-11-02-9D-74-E3-5B-D8-41-56-C5-63-56-88-C0 ;-)
Teller: Well. Your balance is only $1,000,000.
Customer: Ok, Can I please transfer all of it to my Swiss bank account please?
Teller: Yeah sure, what's the account number?
Customer: 12345678.
Teller: Ok, I'll just do that now for you. Would you like a receipt?
Customer: Yes please.
Teller: Here you go => Receipt Provided.
Customer: Bye.
Teller: Bye.

This conversation contains state. Each role of Customer and Teller must know the context of the messages being transferred.

The stateless version of this would go more like...

Customer: Hello, I'd like to know the Balance for account 09-F9-11-02-9D-74-E3-5B-D8-41-56-C5-63-56-88-C0,
Teller: It's $1,000,000

Customer: Could you please transfer $1,000,000 from account 09-F9-11-02-9D-74-E3-5B-D8-41-56-C5-63-56-88-C0 to Swiss bank account number 12345678 and provide a receipt please.
Teller: Yup. Here you go => Receipt Provided.

The end results are the same. However in the first example, "fine-grained operations" are used. The Customer and the Teller need to know where they are in the conversation to understand the context of each statement. In the second, each individual part of the conversation contains all of the information required to perform the requested task.

If we think of these as client and server message exchanges, then in the second example it doesn't matter if the server (teller) is restarted, reinstalled or is blown-up and replaced between the two requests. This is invisible to the end user. You would never be aware of the swap.

However, in the first example, if the server was restarted it would "forget" where you were in the conversation.

Customer: Yes please.
Teller: Sorry? What?

...or in pseudocode...

Client: Yes please.
Server: Sorry? What? WTF? Ka-Boom!

This is a fundamental difference between programming in a “stateful” way, and distributed programming.

Understanding statelessness, its implications and how to manage state in a stateless environment, is a core skill of a web developer. Without understanding the implications of various state management systems, you simply cannot develop large scale applications. In web application there is ALWAYS a client, and a server, and a potentially lengthy request/response message exchange between them.

Most programmers have a fairly firm grip on Object-Oriented programming, and one of the keys to this is the packaging of state and behaviour. As a result, they get locked in to a system of thinking that says state and behaviour should always go together, hence the popularity of technologies such as CORBA, DCOM, and Remoting. These technologies make it easy for people who are used to dealing with "objects" to deal with these without worrying directly about the messaging process that goes on behind the scenes.

However, you often hear complaints about the scalability of these types of remote object technologies. The reason for this often involves assumptions about state. Since remoting interfaces are often automatically generated by whatever remoting framework you are using, and the objects used are very often designed deliberately in such a way as to be re-usable (although not necessarily performant) in both distributed and non-distributed environments, you end up with Chatty, context bound, fine-grained interfaces, which have a hard time keeping up when you have a large number of conversations going on at once.

Ceteris paribus, if you imagine the two conversations above in a remote system where each message exchange took a second, the first would take 7 seconds, the second 2.

What's also interesting is that the two message exchanges in the second conversation can be made either way round or, if required, in parallel. This of course works only if you are able to make some assumptions regarding the availability of cash to you, for example, that you have an overdraft facility of $1,000,000 which you are currently not using. I will come back to this when I discuss Concurrency and Data Consistency, but it is perfectly reasonable to run these operations in parallel. This, along with the rapid growth in multi-core and cloud computing, is one reason for the recent resurgence of interest in functional/stateless programming languages such as Erlang and F#.

The important thing to remember about statelessness is that it is all about maintaining a high degree of disposability. You should be able to throw everything away between message exchanges without a user noticing a server restart. Doing this will enable your application to handle server restarts and changes in network topology gracefully.

Next up: Atomicity.

Monday, 29 June 2009

If you're not confused, you're not paying attention

In the last post we looked at the myths of infinite latency and bandwidth. I'd like quickly to cover some of the other fallacies before coming back to these, and how they result in some fundamental architectural constraints for distributed systems.

So let’s quickly check off a couple of these fallacies and their solutions.

The network is secure - It's not. The internet is a public network. Anything sent over it by default is equivalent to writing postcards. You wouldn't write your bank account details on a post card. Don't send them openly over the web.

Solution: Secure Sockets Layer (SSL), the famous “padlock” icon in your browser. Now my postcard is locked in a postcard sized safe. It can still go missing, but no-one can open it without the VERY long secret combination.

There is one administrator. - There is not. Lots of people, all speaking different languages, with different skills, running different platforms with entirely different needs, requirements and implementations. The people deploying code to the web are often entirely separate from those responsible for keeping that code running, often these groups have nothing to do with each other.

Solution: When designing software for this type of environment, you should make absolutely no assumptions about access to or availability of particular resources. Always write code using a "lowest common denominator" approach. Try to ensure that when your system must fail or degrade, it does so gracefully. I will come back to this in a future post.

These problems resulting from the heterogeneity of distributed systems are mostly solved or worked around by the adoption of the various web standards. The point of the standards is that messages can be exchanged without the requirement for a specific platform, or all the baggage that goes along with that.

Right... take a deep breath. Some of the jargon used in the next couple of paragraphs may cause eye/brain-strain, but do not fear! All will become clear, hopefully in 6 posts’ time. The next 3 posts will cover what I would regard as the most important considerations in building a large web based system, and are directly related to the remaining fallacies.

Fallacy	Consideration/Constraint
Network Topology doesn't change.	Statelessness
Transport cost is zero.	Atomic/Transactionless Interfaces
The network is reliable.	Idempotence (yes. really!)

Following that, I’ll discuss some of characteristics and issues that emerge from systems built within these constraints. Specifically, Concurrency, Horizontal Scalability and Distributed Data Consistency (or inconsistency, as more accurately describes the issue).

So next time... Statelessness.

Wednesday, 17 June 2009

Faster than a speeding bullet

Over the next couple of posts I'm intending to go in to just a little more depth about the fallacies of network computing I mentioned last time. During the past week I have been doing some work using Amazon’s Web Services Platform to speed up one of our websites.

While demonstrating the delivery speed gained by using Amazon’s CloudFront web service, I encountered a surprising (to me at least) reaction. Though impressed by the visible improvement, many non-web developers were raising points such as, "Why does where our site is hosted matter? I thought the web is globally available", "I thought we were paying for unlimited bandwidth with our ISP?"...

Well, this reaction fell straight in to the path of fallacies number 2 and 3!

Latency is zero.
Bandwidth is infinite.

WRONG!!!

Many people assume that bandwidth and latency are the same thing, however the common metaphor is of a system of roads. The number of lanes on the road is equivalent to its bandwidth, and its length is equivalent to its latency.

A high-bandwidth connection usually has lower latency, since it has less chance to get congested at peak times. However, latency and bandwidth are NOT directly related. They normally go together (i.e. high bandwidth usually means better ping times), but they don't have to.

Even if we could buy "unlimited" bandwidth from our supplier, then if our website is served up from a single location, no matter how fast the connection, it will still suffer from the effects of network latency/lag when viewed at a global level.

It's all to do with the speed of light. No, really! It's Physics time. Hooray!

With the exception some new fangled weird physics theories, as far as normal physics goes, the speed of light is constant. Nothing can travel faster than it, and this "nothing" rule therefore applies to the messages you send around the internet.

Light goes very "fastly" (as my 2 year old son says) - about 300K km/second. This is roughly a million times faster than sound, and fast enough to zoom around the Earth more than 7 times in one second. This sounds really quick, and it is for day to day things like phoning someone (or using smoke signals); you wouldn't perceive it. However, over distances of thousands of miles, the assumption that you never need to worry about the speed starts to break down.

You can sometimes see this effect if you have satellite TV and the same programme is being broadcast on both terrestrial and satellite channels. If you flick between the channels you'll see the satellite picture arrives a moment later than the one broadcast from the ground.

Broadcasting uses "unreliable" protocols, so if messages are held up, they are assumed missing in action and your signal/picture will break up. TCP/IP however is a "reliable" protocol, so if messages are held up, you have to wait for them to arrive, and your experience is that your signal will appear to slow down.

If you're a computer trying to send a message to the other side of the world, it takes about 100 milliseconds to "ping" a computer in Australia from here in the UK (assuming your message travels at the speed of light).

Try this!... Ping a well known server in Australia:

> ping wwf.org.au

...and a well know server here in the UK:

> ping direct.gov.uk

Depending upon where you are located, you'll probably see very different times. I'm seeing ping times of between 340 and 360 ms to Australia, and just 14 to 21 pinging locally (from Scotland to London).

So what does this mean to me as a developer? Well if my users are in London and my site is hosted in Australia, it's going to take a long time to load.

The most common way to mitigate this problem is to use the smallest possible message size, compress your graphics, and so on. However, for rich content like large flash or silverlight files, or streaming video, this is often impossible. An alternative is to put your content as close as possible, in network terms, to your end user.

Whatever your application does, there are some important design decisions to be made in terms of how to partition, deploy and manage it to ensure an optimum experience for end users.

The most common design practices to work around these 2 assumptions are...

1. Optimise your message sizes carefully.
2. Compress any large content.
3. Make a clear separation between types of content, so that your application can be partitioned either logically or geographically if necessary.

At a basic level you might separate static vs. dynamically generated content, but could also be partitioning based on expected usage, data size, or resource locale, for example.

Whatever you need to take in to account for your specific application, the one rule to rule them all is
"Never assume that the network is fast enough by default."

About Me

Hi, My name is Chris and I have been working as a web developer for about 10 years, primarily working on the Microsoft plaform.

I have an interest in web standards as a member of the HTML Working Group at the W3C and have an active interest in large distributed systems and "Cloud Computing". I took part in the private beta of Amazon's SimpleDB service and I am currently experimenting with (the interesting parts of) Microsofts Azure platform preview.

I'm also re-discovering javascript/css via the brilliant jQuery and Sizzle selector engine.