Scale This!: The journey of 1000 miles

Right. First things first. Before you can have any reasonable discussion about how best to design applications for the web/cloud, you have to define what the web/cloud actually is, so I'll start with the web, as that's how most things in the world now talk to each other.

Everyone knows what the web is. You type in www.google.com, and enter "lolcats". Your browser connects to Google’s servers, sends "lolcats" as a search query, and Google searches its database and sends you lots of stuff back about, well erm... "cats with unusual grammar". Simple enough. However, as you may suspect, getting a list of cats (or whatever) to a screen thousands of miles away is more complicated than it may at first appear.

To see how this works in practice, we'll start with the simple example, that of fetching my first blog post. The commonly understood model of the web is that you connect to a server and download the requested information. This is a useful abstraction, but not entirely accurate.

The reality is more complicated, and involves a number of layers. Each layer builds upon the one below it. Each passes messages to the next layer via progressively more abstract protocols. So, the basic process works like this...

An end user using a browser asks for a resource in the form of a web address - a URL (Uniform Resource Locator)
http://scalethis.blogspot.com/2009/05/hello-world.html

This URL specifies the protocol (http://), domain (scalethis.blogspot.com) and resource (/2009/05/hello-world.html) requested.

We can pack this information up in an HTTP (Hypertext Transfer Protocol) request message which looks like this...

GET /2009/05/hello-world.html HTTP/1.1

Your browser then needs to find a machine that is capable of dealing with this message. To do this, it uses another Internet system called DNS (Domain Name System) to translate the domain into an actual machine to send the message to. This works like a telephone directory lookup. DNS finds that the name "scalethis.blogspot.com" is associated with the actual IP (Internet Protocol) address, 209.85.227.191.

You can see how this works by using "ping" from your command line.

C:\>ping scalethis.blogspot.com
Pinging blogspot.l.google.com [209.85.227.191] with 32 bytes of data

Now the browser knows...
what we are looking for (/2009/05/hello-world.html)
from where (209.85.227.191)
and how to ask for it (http)

Now, unlike some other networks, the internet’s big trick is that - despite appearing as if you connect to a remote machine - the TCP/IP protocol suite is in reality "connectionless". Instead of connecting directly to the remote machine, it basically packages up your request in the form of a message and writes an address on it. "Please send this to 209.85.227.191". It then sends this on to its nearest router, which forwards it on to another router, and another... until it reaches its destination.

You can see how this works by using "tracert" from your command line:

C:\>tracert scalethis.blogspot.com

Tracing route to blogspot.l.google.com [209.85.227.191]
over a maximum of 30 hops:
1 10.0.0.1
2 195.224.48.153
3 195.224.185.40
4 62.72.142.5
5 62.72.137.9
6 62.72.139.118
7 209.85.255.175
8 66.249.95.170
9 72.14.236.191
10 209.85.243.101
11 209.85.227.191

Here you can see all the machines through which your message has passed before finally reaching 209.85.227.191, where scalethis.blogspot.com can be found.

The clever part of this is that if one of the machines in the middle is suddenly unavailable, by way of either nuclear war or coffee spillage, the previous router can simply send the message to another router and so navigate around the problem, much in the same way that your SatNav would re-route you around Birmingham at rush hour. All this business of finding the shortest path and routing around traffic blackspots is a bit of rocket-science handled by various routing protocols, but we'll save that for another day.

So now your message has reached 209.85.227.191! Hooray!

Now, what to do with it? Well the server knows it's an HTTP message, which is a good thing because 209.85.227.191 is a web server, and knows how to understand the message

GET /2009/05/hello-world.html HTTP/1.1

It can see that you're asking to "GET" /2009/05/hello-world.html. "GET" is only one of a number of HTTP "verbs", some of which I'll describe in my next post. For now we can package up a response in order to reply. The HTTP server knows that the resource "/2009/05/hello-world.html" is held physically on "F:\Users\Temp\Backup\PleaseDontDeleteThis\ScaleThis\2009\05\hello-world.html", loads it up, and sends it back using the same forwarding technique.

Your browser reads this message, which contains HTML - like text but with lots of angled brackets - and displays it to you in a nicely formatted way! Yippee!

And it does all of this within a second or two (unless you're using AOL ;-)).

I've deliberately avoided talking in any detail about the higher level languages of the web (HTML, XML, SOAP, CSS, ECMA/JavaScript etc.) and what I'd refer to as the "overweb" made up of plug-ins (Flash/Silverlight, Java Applets, RealPlayer etc.), as I'll be discussing these quite a lot in the future. So for now the key takeaways are...

The web is a massive global network which uses messages to send information between computers, using routers.
These messages are all in standard formats (protocols) so that any software or hardware that sticks to those standards can understand them.
There is a fair amount of communication required to co-ordinate delivery of these messages, so the internet can be slow compared to networks that require a direct connection, such as traditional telephone networks. However, this co-ordinated exchange means that the web in its nature is flexible, reliable, and highly resilient to change.

If you're a developer you should get an overall idea of the how the protocols work. You don't necessarily need to understand the syntax of the Syn/Ack handshake in TCP, but you should at least know what the major protocols are, what they are for, and have an understanding of how they work, at least at a Wikipedia level. Your starting point can be found here...

http://en.wikipedia.org/wiki/Internet_Protocol_Suite

but the ones of particular concern to the functioning of the web are DNS, TCP, IP and HTTP. Whilst not strictly part of the internet protocol suite, you should have a read up on routing protocols too, particularly the Border Gateway Protocol (BGP).

Next time, I'll take a closer look at HTTP and some basic HTML/XML, and that should give us a reasonable common frame of reference on which we can build.

Monday, 18 May 2009

The journey of 1000 miles

2 comments: