This article provides a high level overview of the technologies that are the foundation of workings of the World Wide Web.
A Little Background
In the 1980s, when Sir Tim Berners-Lee was working at CERN, he was frustrated with the difficulty of finding and accessing documents:
I found it frustrating that in those days, there was different information on different computers, but you had to log on to different computers to get at it. Also, sometimes you had to learn a different program on each computer. So finding out how things worked was really difficult.1
In 1989, he submitted a proposal for a hypertext2 based system that would allow access and interlinking of documents regardless of where they were located. Up to that time, the standard practice for document organization was a centralized repository that was hierarchically organized. Sir Tim Berners-Lee proposed to do away with that:
[T]the hope would be to allow a pool of information to develop which could grow and evolve with the organisation and the projects it describes. For this to be possible, the method of storage must not place its own restraints on the information. This is why a “web” of notes with links (like references) between them is far more useful than a fixed hierarchical system.3
In 1990, he wrote the first web browser and web server and developed three fundamental technologies that form the foundation of the World Wide Web:
- Hypertext Markup Language (HTML) – this is used to create webpages.
- Uniform Resource Locator (URL) – a string of characters used to identify a resource and its location.4
- Hypertext Transfer Protocol (HTTP) – the protocol by which HTML resources are accessed on the Internet.
While Sir Tim Berners-Lee’s primary concern was for a web of interconnected hypertext documents, he realized that many other resources could be interlinked. His implementation was so successful that the World Wide Web includes:
- Documents – like this webpage
- Images – Flickr allows people to share and access photos
- Video – YouTube allows people to share and access videos
- Audio – Spotify allows people to listen to music
- Applications – Google Docs is an online word processor
The first website to go live was at CERN in 1991.
By the end of 1991, there were a total of three websites worldwide:
- The World Wide Web Virtual Library (also at CERN)
- Stanford Linear Accelerator Center – the first North American website.
By the end of 1992, there were a total of 10 websites worldwide.
By the end of 2017, there were over 1.3 billion websites.
The World Wide Web
The Internet is the global network of interconnected devices transferring datagrams5 adhering to the Internet Protocol (IP). Most datagrams are transported over the Internet using the Transmission Control Protocol (TCP) protocol.6
The World Wide Web (WWW) is a service that runs on top of the Internet.
The easiest way to think about the WWW is that it is the global collection of documents7 written using the Hypertext Markup Language (HTML)8, identified by a Uniform Resource Locator (URL), and transferred using the Hypertext Transfer Protocol (HTTP). All of this is done on the Internet using TCP and IP.
You might wonder why there are so many protocols to move stuff about on the Internet. Why can’t TCP or IP be used directly to send a webpage or stream videos? What is the purpose of HTTP?
It might be useful to consider:
- IP is for identifying devices on the Internet. This is similar to addresses for homes and businesses. Addresses have nothing to do with content or its transportation, even though content is transferred from one address to another. IP identifies the machines, not the resources on the machines.
- TCP is for transporting datagrams from one address to another on the Internet. This is similar how a delivery company transports goods from one address to another. The delivery company isn’t responsible for the content or for the addresses. They are only responsible for delivering the content to the correct address.
- HTTP is about accessing the content. It has nothing to do with transporting content from one address to another. It has nothing to do with how devices are identified on the Internet. A (very) imperfect analogy might be that HTTP provides a way of “packaging” resources on the Internet so they can be transported by TCP from one IP address to another.9
HTTP is not the only application layer protocol in use on the Internet, but it is the most visible. Most people don’t notice (or even know about) other application layer protocols such as: FTP,10 SSH,11 SMTP,12 POP3,13 etc.
Hypertext Markup Language (HTML)
HTML documents are sent by web servers and rendered by web clients (web browsers) to display the contents of a webpage to a user.
Let’s examine the following simple webpage:
All HTML documents are composed of HTML tags which (usually) come in pairs. There is an opening tag, like <html> and its corresponding closing tag </html>. These tags markup the document.
HTML tags perform three different functions:
- Structural Markup tags are used to indicate the structure and purpose of the text in the document. We can see that this page is divided into two parts: a <head> that contains metainformation15 about the page and a <body> that contains the content displayed to the user. In the body, we see there is a heading (<h1>) and a paragraph (<p>).
- Presentational Markup is used to indicate how text should be displayed to the user – for example, bold, italic,
strikethrough, etc. There is no presentation markup in this simple webpage. CSS is recommended for presentation markup.
- Hypertext Markup is used to create links inside the document to other documents or resources. In this page, there is a single hypertext link – the anchor tag <a>
In general, HTML documents are plain text documents with special annotations (HTML tags) to describe the structure of the document. Web clients (web browsers) display the content to the user.
Uniform Resource Locator (URL)
A Uniform Resource Locator (URL) is used to access resources on the WWW. It has the following format:
<access protocol>://<host>/<location & resource name>
- access protocol specifies how the resource is to be accessed. For the WWW it is either HTTP or HTTPS16. There are many different access protocols for the Internet.17 Technically, you should always include the access protocol when you type the URL for a website. However, browsers (being helpful) will automatically prefix the protocol for you.
- host specifies which device on the Internet contains the resource. It can be an IP address (like 127.0.0.1) or, more commonly, a human readable string (like www.complete-concrete-concise.com) which is translated by a Domain Name System (DNS) into an IP address.
- location & resource name specifies the name of the resource and where it is located. If no resource name is given, web servers return index.html by default.
Let’s consider the following URL:
- The access protocol is https.
- The host is complete-concrete-concise.com.
- The location & resource is /sample/helloworld.html.
Hypertext Transfer Protocol (HTTP)
HTTP is a flexible request-response protocol for data transfer on the World Wide Web. It is most commonly used for transferring hypertext documents, like HTML, but can be used to transfer other types of content.
It is based on a client-server model. The client sends a request to a server. The server then responds to the client. The classic example of this is a web browser (client) that requests a webpage from a website (server); the website (server) responds by sending the webpage to the web browser (client).18
The World Wide Web consists of web clients19 and web servers20 that interact in the following way:
- A web client sends an HTTP request to a web server.
- The web server receives the HTTP request.
- The web server processes the request.
- The web server sends a HTTP response to the web client.
- The web client receives the HTTP response.
- The web client processes the HTTP response (processing usually involves preparing the response for display to the user of the web client).
It is a stateless protocol.21 This means that each HTTP request is independent of all other HTTP requests. In other words, the current request knows nothing about previous requests: all information to fulfill the request must be contained in the request.
There are two HTTP message types: requests and responses. Clients always send requests and servers always reply with responses.
Two common HTTP requests are GET and POST:
- GET is used to request a resource from a server – for example, a webpage.
- POST is used to send data to a server – for example, submitting a comment or a form.
Two common HTTP responses are 200 (OK) and 404 (NOT FOUND):
- 200 (OK) means the request was accepted and processed correctly. There is never any reason to display this response to the user.
- 404 (NOT FOUND) is returned by the server when the requested resource cannot be found. It means the server understood the request, but was not able to find the requested resource.
- The World Wide Web is one of many services that run on top of the Internet.
- The World Wide Web is the global collection of resources written using Hypertext Markup Language (HTML),22 identified using a Uniform Resource Locator (URL), and transferred using the Hypertext Transfer Protocol (HTTP).
- Clients request resources from servers using HTTP.
- Servers respond to clients using HTTP.
- HTTP is transported using TCP between devices adhering to the Internet Protocol.
- The flexibility of HTTP has contributed to the World Wide Web becoming extremely popular because it handles all types of content: from documents to videos, from images to applications.
If you are interested in learning more about, check out the following resources:
World Wide Web
Uniform Resource Locators
Hypertext Transfer Protocol
- RFC2616 – this is an older document about the HTTP/1.1 specification. It is a good place to start because it is all in one place.
If you want more information, then the following, more recent, documents give more information. They update RFC2616 and break it down into multiple documents, but may be information overload:
- RFC 7230 Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing
- RFC 7231 Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content
- RFC 7232 Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests
- RFC 7233 Hypertext Transfer Protocol (HTTP/1.1): Range Requests
- RFC 7234 Hypertext Transfer Protocol (HTTP/1.1): Caching
- RFC 7235 Hypertext Transfer Protocol (HTTP/1.1): Authentication
- From Answers For Young People↩
- Hypertext is a type of document that contains links (called hyperlinks) to other documents. These links can be clicked to access the other documents.↩
- From Information Management: A Proposal↩
- Strictly speaking, this is untrue. Berners-Lee developed the Uniform Resource Identifier (URI). There are many kinds of URIs, two common ones are: Uniform Resource Name (URN) and Uniform Resource Locator (URL). A URN identifies a resource. For example, the book A Tale of Two Cities by Charles Dickens is a URN. A URL, on the other hand, identifies both the resource and its location. For example, the book A Tale of Two Cities by Charles Dickens can be found at the library on the third bookshelf, second shelf from the bottom, fifth book from the left is a URL because it specifies both the resource and its location. A URN only specifies the name of a resource. Therefore, all URLs are URIs, but not all URIs are URLs (because some URIs are URNs)↩
- A datagram is a packet of information. The IP datagram consists of a header and a payload.↩
- There are other protocols, but TCP’s robustness and reliability make it very popular.↩
- Strictly speaking, this is untrue. The official definition of the World Wide Web is “an information space in which the items of interest, referred to as resources, are identified by global identifiers called Uniform Resource Identifiers (URI)”. The official definition is vague on what a “resource is”: “We do not limit the scope of what might be a resource.”. However, practically speaking, most, if not all, resources are HTML webpages or things that act as if they were HTML pages.↩
- It does not have to be HTML, it could be XHTML, XML, SVG, or any other HTML-like markup language.↩
- Technically, HTTP is an application layer protocol and interface. This concept doesn’t translate well into the real, physical world we inhabit. If you give someone a drink of water, it has to be in an interface that is useful – like a glass. The glass is HTTP, giving the person the water in the glass is TCP, and the receiver is IP. Consider shipping marbles to someone: there is an address (IP), a courier company does the delivery (TCP), but, in order to ship the marbles, you have to adhere to some conventions (protocols) and package the marbles in a box (interface) (HTTP).↩
- File Transfer Protocol is used for transferring computer files.↩
- Secure Shell provides secure access to a computer over a network.↩
- Simple Mail Transfer Protocol is used for transmitting emails.↩
- Post Office Protocol is used retrieve email from a remote server.↩
- These will all be covered in greater detail in future tutorials.↩
- Metainformation is information about the document, not the information contained in the document. For example, metainformation about a document could include its: length, language, date of publication, date of last revision, author, etc.↩
- HTTPS is the secure form of HTTP. Using HTTPS ensures all communication between the client and server is encrypted. Surprisingly, HTTP and HTTPS access can both occur on the same page. This occurs because HTML documents often contains hyperlinks to other documents. Those hyperlinks point to documents that are transmitted using either HTTP or HTTPS.↩
- A list of official and unofficial protocols can be found here↩
- Actually, the server can respond in a number of different ways. For example, if the webpage is not found, it can respond with a 404 error.↩
- For example: Mozilla FireFox, Google Chrome, Apple Safari, Microsoft Edge and many others.↩
- For example: Apache2, nginx, and many others.↩
- This is true of the HTTP protocol itself. Servers tend to log all or part of incoming HTTP requests. These logs are for maintenance and forensics, not for keeping track of previous HTTP requests. Clients also tend to cache components of the HTTP response – for example: cookies, images, HTML pages, and other resources. These topics will be addressed in future tutorials.↩
- This is mostly true, the documents don’t have to be written in HTML, but it does have to be something that – more or less – behaves as if it were HTML – for example, xhtml.↩