A Little Background
In the 1980s, Sir Tim Berners-Lee was working at CERNfound it difficult to access documents and information from different computers:
I found it frustrating that in those days, there was different information on different computers, but you had to log on to different computers to get at it. Also, sometimes you had to learn a different program on each computer. So finding out how things worked was really difficult.1
In 1989, he proposed a hypertext2 based system for accessing, sharing, and linking documents. At that time, the standard practice for document organization was a centralized repository. Sir Tim Berners-Lee proposed to do away with that:
[T]he hope would be to allow a pool of information to develop which could grow and evolve with the organisation and the projects it describes. For this to be possible, the method of storage must not place its own restraints on the information. This is why a “web” of notes with links (like references) between them is far more useful than a fixed hierarchical system.3
In 1990, he wrote the first web browser and web server and developed three fundamental technologies that form the foundation of the World Wide Web:
- Hypertext Markup Language (HTML) – this is used to create webpages.
- Uniform Resource Locator (URL) – a string of characters used to identify a resource and its location.4
- Hypertext Transfer Protocol (HTTP) – the protocol by which HTML resources are accessed on the Internet.
While Sir Tim Berners-Lee’s primary concern was for a web of interconnected hypertext documents, he realized that many other resources could be interlinked. His implementation was so successful that the World Wide Web includes:
- Documents – like this webpage
- Images – Flickr allows people to share and access photos
- Video – YouTube allows people to share and access videos
- Audio – Spotify allows people to listen to music
- Applications – Google Docs is an online word processor
The first website to go live was at CERN in 1991.
By the end of 1991, there were a total of three websites worldwide:
- CERN
- The World Wide Web Virtual Library (also at CERN)
- Stanford Linear Accelerator Center – the first North American website.
By the end of 1992, there were a total of 10 websites worldwide.
As of early 2023, there are almost 2 billion websites.
The World Wide Web
The Internet is the global network of interconnected devices transferring datagrams5 using the Internet Protocol (IP). Most datagrams are transported using the Transmission Control Protocol (TCP) protocol.6
The World Wide Web (WWW) is a service that runs on the Internet. It is the collection of documents7 written in Hypertext Markup Language (HTML)8, identified by a Uniform Resource Locator (URL), and transferred using the Hypertext Transfer Protocol (HTTP). All of this is done on the Internet using TCP and IP.
IP identifies devices on the Internet, TCP transports datagrams between them, and HTTP9 is used to access content. TCP is like a delivery company that transports goods to the correct address, while HTTP packages resources so that they can be transported by TCP from one device to another. There are other application layer protocols like FTP,10 SSH,11 SMTP,12 POP3,13 that are used on the Internet, but HTTP is the most commonly used.
HTTP is not the only application layer protocol in use on the Internet, but it is the most visible. Most people don’t notice (or even know about) other application layer protocols such as: FTP,10 SSH,11 SMTP,12 POP3,13 etc.
Hypertext Markup Language (HTML)
HTML is used to write webpages and web applications. It is often combined with Cascading Style Sheets (CSS) and JavaScript (JS).14
HTML documents are sent by web servers and rendered by web clients (web browsers) to display the contents of a webpage to a user.
Here’s an example of a simple webpage:
<!DOCTYPE html>
<html>
<head>
<title>A Simple Webpage</title>
</head>
<body>
<h1>This is important!</h1>
<p>Check out this site:</p>
<a href="//example.com">Example</a>
</body>
</html>
All HTML documents are composed of HTML tags which (usually) come in pairs. There is an opening tag, like <html> and its corresponding closing tag </html>. These tags markup the document.
HTML tags perform three different functions:
- Structural Markup tags are used to indicate the structure and purpose of the text in the document. We can see that this page is divided into two parts: a <head> that contains metainformation15 about the page and a <body> that contains the content displayed to the user. In the body, we see there is a heading (<h1>) and a paragraph (<p>).
- Presentational Markup is used to indicate how text should be displayed to the user – for example, bold, italic,
strikethrough, etc. There is no presentation markup in this simple webpage. CSS is recommended for presentation markup. - Hypertext Markup is used to create links inside the document to other documents or resources. In this page, there is a single hypertext link – the anchor tag <a>
In general, HTML documents are plain text documents with special annotations (HTML tags) to describe the structure of the document. Web clients (web browsers) display the content to the user.
Uniform Resource Locator (URL)
A Uniform Resource Locator (URL) is used to access resources on the WWW. It has the following format:
<access protocol>://<host>/<location & resource name>
- access protocol specifies how the resource is to be accessed. For the WWW it is either HTTP or HTTPS16. There are many different access protocols for the Internet.17 Technically, you should always include the access protocol when you type the URL for a website. However, browsers (being helpful) will automatically prefix the protocol for you.
- host specifies which device on the Internet contains the resource. It can be an IP address (like 127.0.0.1) or, more commonly, a human readable string (like www.complete-concrete-concise.com) which is translated by a Domain Name System (DNS) into an IP address.
- location & resource name specifies the name of the resource and where it is located. If no resource name is given, web servers return index.html by default.
Let’s consider the following URL:
https://complete-concrete-concise.com/sample/helloworld.html
- The access protocol is https.
- The host is complete-concrete-concise.com.
- The location & resource is /sample/helloworld.html.
Hypertext Transfer Protocol (HTTP)
HTTP responses indicate the status of a request, with two common responses being 200 (OK) for a successful request and 404 (NOT FOUND) for when the requested resource cannot be found.
HTTP is a request-response protocol for data transfer on the World Wide Web. It is most commonly used for transferring hypertext documents, like HTML, but can be used to transfer other types of content.
It operates on a client-server model. The client sends a request to a server. The server then responds to the client. For example, a web browser (client) that requests a webpage from a website (server); the website (server) responds by sending the webpage to the web browser (client).18
It is a stateless protocol.19 This means that each HTTP request is independent of all other HTTP requests. In other words, the current request knows nothing about previous requests: all information to fulfill the request must be contained in the request itself.
Two common HTTP requests are GET, used to request a resource from a server, and POST, used to send data to a server, such as when submitting a comment or form.
HTTP responses indicate the status of a request, with two common responses being 200 (OK) for a successful request and 404 (NOT FOUND) for when the requested resource cannot be found.
Summary
- The World Wide Web is one of many services that run on top of the Internet.
- The World Wide Web is the global collection of resources written using Hypertext Markup Language (HTML),20 identified using a Uniform Resource Locator (URL), and transferred using the Hypertext Transfer Protocol (HTTP).
- Clients request resources from servers using HTTP.
- Servers respond to clients using HTTP.
- HTTP is transported using TCP between devices adhering to the Internet Protocol.
- The flexibility of HTTP has contributed to the World Wide Web becoming extremely popular because it handles all types of content: from documents to videos, from images to applications.
Further Reading
If you are interested in learning more about, check out the following resources:
World Wide Web
Uniform Resource Locators
Hypertext Transfer Protocol
- RFC2616 – this is an older document about the HTTP/1.1 specification. It is a good place to start because it is all in one place.
If you want more information, then the following, more recent, documents give more information. They update RFC2616 and break it down into multiple documents, but may be information overload:
- RFC 7230 Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing
- RFC 7231 Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content
- RFC 7232 Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests
- RFC 7233 Hypertext Transfer Protocol (HTTP/1.1): Range Requests
- RFC 7234 Hypertext Transfer Protocol (HTTP/1.1): Caching
- RFC 7235 Hypertext Transfer Protocol (HTTP/1.1): Authentication
- From Answers For Young People↩
- Hypertext is a type of document that contains links (called hyperlinks) to other documents. These links can be clicked to access the other documents.↩
- From Information Management: A Proposal↩
- Strictly speaking, this is untrue. Berners-Lee developed the Uniform Resource Identifier (URI). There are many kinds of URIs, two common ones are: Uniform Resource Name (URN) and Uniform Resource Locator (URL). A URN identifies a resource. For example, the book A Tale of Two Cities by Charles Dickens is a URN. A URL, on the other hand, identifies both the resource and its location. For example, the book A Tale of Two Cities by Charles Dickens can be found at the library on the third bookshelf, second shelf from the bottom, fifth book from the left is a URL because it specifies both the resource and its location. A URN only specifies the name of a resource. Therefore, all URLs are URIs, but not all URIs are URLs (because some URIs are URNs)↩
- A datagram is a packet of information. The IP datagram consists of a header and a payload.↩
- There are other protocols, but TCP’s robustness and reliability make it very popular.↩
- Strictly speaking, this is untrue. The official definition of the World Wide Web is “an information space in which the items of interest, referred to as resources, are identified by global identifiers called Uniform Resource Identifiers (URI)”. The official definition is vague on what a “resource is”: “We do not limit the scope of what might be a resource.”. However, practically speaking, most, if not all, resources are HTML webpages or things that act as if they were HTML pages.↩
- It does not have to be HTML, it could be XHTML, XML, SVG, or any other HTML-like markup language.↩
- Technically, HTTP is an application layer protocol and interface. This concept doesn’t translate well into the real, physical world we inhabit. If you give someone a drink of water, it has to be in an interface that is useful – like a glass. The glass is HTTP, giving the person the water in the glass is TCP, and the receiver is IP. Consider shipping marbles to someone: there is an address (IP), a courier company does the delivery (TCP), but, in order to ship the marbles, you have to adhere to some conventions (protocols) and package the marbles in a box (interface) (HTTP).↩
- File Transfer Protocol is used for transferring computer files.↩
- Secure Shell provides secure access to a computer over a network.↩
- Simple Mail Transfer Protocol is used for transmitting emails.↩
- Post Office Protocol is used retrieve email from a remote server.↩
- These will all be covered in greater detail in future tutorials.↩
- Metainformation is information about the document, not the information contained in the document. For example, metainformation about a document could include its: length, language, date of publication, date of last revision, author, etc.↩
- HTTPS is the secure form of HTTP. Using HTTPS ensures all communication between the client and server is encrypted. Surprisingly, HTTP and HTTPS access can both occur on the same page. This occurs because HTML documents often contains hyperlinks to other documents. Those hyperlinks point to documents that are transmitted using either HTTP or HTTPS.↩
- A list of official and unofficial protocols can be found here↩
- Actually, the server can respond in a number of different ways. For example, if the webpage is not found, it can respond with a 404 error.↩
- This is true of the HTTP protocol itself. Servers tend to log all or part of incoming HTTP requests. These logs are for maintenance and forensics, not for keeping track of previous HTTP requests. Clients also tend to cache components of the HTTP response – for example: cookies, images, HTML pages, and other resources. These topics will be addressed in future tutorials.↩
- This is mostly true, the documents don’t have to be written in HTML, but it does have to be something that – more or less – behaves as if it were HTML – for example, xhtml.↩