We will begin this chapter by taking a closer look at the technologies that make the Internetspecifically the World Wide Web (WWW)work. This discussion will be a useful refresher for many, and will provide a solid foundation for showing how our web applications work. It will also frame our later discussions of performance, scalability, and security.
The Internet: It's Less Complicated Than You Think
Many beginning programmers will confess that the workings of the World Wide Web are something only slightly less fantastical than black magic. However, this could not be further from the truth. One of the most powerful features of the Internet, and surely one of the greatest factors leading to its rapid acceptance and widespread adoption, is that it is based almost entirely on simple, open specifications and technologies that are freely available for all.
Most of the work on the Internet is done by a few key protocols or mechanisms by which computers talk to each other. For example, TCP/IP is the base protocol by which computers communicate. Other protocols, which provide progressively more and more functionality, are built on top of this. As we shall see, the Hypertext Transfer Protocol (HTTP) is simply a series of text messages (albeit with some well-defined structure to them) sent from one computer to another via TCP/IP.
While this flexibility and openness facilitates adoption and allows for easy customization and extensions, it does have its drawbacks, most notably in the realm of security. Because of its well-known structure and the availability of free implementations, there are many opportunities for people with less than noble intentions to exploit this openness. We will cover this in greater detail in Chapter 16, "Securing Your Web Applications: Planning and Code Security."
One important factor that should be considered when writing a web application is the speed at which users can connect to the Internet. Although there has been a surge in availability of high-bandwidth connections over mediums such as DSL, television cable, and satellite, a large portion of users are still using standard modems no faster than 56kbps.
If you are designing an application that you know will only be used by corporate customers with high-speed connections, it might not be a problem to include large, high-resolution images or video in your site. However, home users might be quickly turned off, opting to go somewhere less painfully slow.
Computers Talking to Computers
The key technology that makes the Internet work as we know it today is the TCP/IP protocol, which is actually a pair of protocols. The Internet Protocol (IP) is a mechanism by which computers identify and talk to each other. Each computer has what is called an IP address, or a set of numbers (not entirely unlike a telephone number) that identify the computer on the Internet. The Internet Protocol allows two computers with IP addresses to send each other messages.
The format of these IP addresses depends on the version of IP in use, but the one most commonly used today is IPv4, where IP addresses consist of four one-byte digits ranging from 0254 (255 is reserved for broadcasting to large numbers of computers), and is typically written as xxx.yyy.zzz.www (such as 192.168.100.1). There are various ways of grouping the possible IPv4 addresses so that when a particular machine wants to send a message to another, it does not need to know the destination's exact location, but instead can send the message to intermediaries that know how to ensure the message ends up at the correct computer (see Figure 13-1).
Figure 13-1. Two computers talking over the Internet via TCP/IP.
However, one problem with IPv4 is that the number of unallocated IP addresses is running low, and there is often an uneven distribution of addresses. (For example, there are a few universities in the USA with more IP addresses than all of China!) One way of conserving IP addresses is for organizations to use a couple of reserved address ranges for internal use and only have a few computers directly exposed to the Internet. These reserved ranges (192.168.x.y and 10.x.y.z) can be used by anybody and are designed to be used for internal networks. They usually have their traffic routed to the Internet by computers running network address translators (NAT), which allow these "nonpublic" addresses to take full advantage of the features available.
A new version of IP, IPv6, has been developed and is seeing increasing adoption. While it would not hurt to learn about this new version and its addressing scheme (which has a significantly larger address space than IPv4), we mostly use IPv4 in our examples (although we do not do anything to preclude the use of IPv6). Many key pieces of software, including Apache HTTP Server and PHP, are including IPv6 support in newer releases for early adopters.
The Internet Protocol does little else than allow computers to send messages to each other. It does nothing to verify that messages arrive in the order they were sent without corruption. (Only the key header data is verified.)
To provide this functionality, the Transmission Control Protocol (TCP) was designed to sit directly on top of IP. TCP makes sure that packets actually arrive in the correct order and makes an attempt to verify that the contents of the packet are unchanged. This implies some extra overhead and less efficiency than IP, but the only other alternative would be for every single program to do this work itselfa truly unpleasant prospect.
TCP introduces a key concept on top of the IP address that permits computers to expose a variety of services or offerings over the network called a port. Various port numbers are reserved for different services, and these numbers are both published and well known. On the machine exposing the services, there are programs that listen for traffic on a particular portservices, or daemons. For example, most e-mail occurs over port 25, while HTTP traffic for the WWW (with which we are dealing extensively in this book) occurs on port 80. You will occasionally see a reference to a web site (URL) written as http://www.mywebsitehooray.com:8080, where the :8080 tells your web browser what port number to use (in this case, port 8080).
The one other key piece of the Internet puzzle is the means by which names, such as www.warmhappybunnies.com, are mapped to an IP address. This is done by a system called the Domain Name System, or DNS. This is a hierarchical system of naming that maps names onto IP addresses and provides a more easily read and memorized way of remembering a server.
The system works by having a number of "top level" domains (com, org, net, edu, ca, de, cn, jp, biz, info, and so on) that then have their own domains within them. When you enter a name, such as www.warmhappybunnies.com, the software on your computer knows to connect to a DNS server (also known as a "name server") which, in turn, knows to go to a "root name server" for the com domain and get more information for the warmhappybunnies domain. Eventually, your computer gets an IP address back, which the TCP/IP software in the operating system knows how to use to talk to the desired server.
The Hypertext Transfer Protocol
The web servers that make the WWW work typically "listen" on port 80, the port reserved for HTTP traffic. These servers operate in a very simple manner. Somebody, typically called the "client," connects to the port and makes a request for information from the "server." The request is analyzed and processed. Then a response is sent, with content or with an error message if the request was invalid or unable to be processed. After all of this, the connection is closed and the server goes back to listening for somebody else to connect. The server does not care who is connecting and asking for data andapart from some simple loggingdoes not remember anything about the connection. This is why HTTP is sometimes called a stateless protocolno information is shared between connections to the server.
The format of both the HTTP request and response is very simple. Both share the following plain text format:
Initial request/response line Optional Header: value [Other optional headers and values] [blank line, consisting of CR/LF] Optional Body that comes after the single blank line.
An HTTP request might look something like the following:
GET /index.php HTTP/1.1 Host: www.myhostnamehooray.com User-Agent: WoobaBrowser/3.4 (Windows) [this is a blank line]
The response to the previous request might be something as follows:
HTTP/1.1 200 OK Date: Wed, 8 Aug 2001 18:08:08 GMT Content-Type: text Content-Length: 1234 <html> <head> <title>Welcome to my happy web site!</title> </head> <body> <p>Welcome to our web site !!! </p> ... .. . </body> </html>
There are a couple of other HTTP methods similar to the GET method, most notably POST and HEAD. The HEAD method is similar to the GET method with the exception that the server only sends the headers rather than the actual content.
As we saw in Chapter 7, "Interacting with the Server: Forms," The POST method is similar to the GET method but differs in that it is typically used for sending data for processing to the server. Thus, it also contains additional headers with information about the format in which the data is presented, the size of this data, and a message body containing it. It is typically used to send the results from forms to the web server.
POST /createaccount.php HTTP/1.1 Host: www.myhostnamehooray.com User-Agent: WoobaBrowser/3.4 (Windows) Content-Type: application/x-www-form-urlencoded Content-Length: 64 username=Marc&address=123+Somewhere+Lane&city=Somewhere&state=WA
The Content-Type of this request tells us how the parameters are put together in the message body. The application/x-www-form-urlencoded type means that the parameters have the same format for adding parameters to a GET request as we saw in Chapter 7 with the format
Thus, the POST request can be sent as a GET :
GET /createaccount.php?username=Marc&address=123 +Some+Lane&city=Somewhere&state=WA HTTP/1.1 Host: www.myhostnamehooray.com User-Agent: WoobaBrowser/3.4 (Windows) [this is a blank line]
Many people assume that since they cannot easily see the values of the form data in POST requests that are sent to the server, they must be safe from prying eyes. However, as we have just seen, the only difference is where they are placed in the plain-text HTTP request. Anybody with a packet-sniffer who is looking at the traffic between the client and the server is just as able to see the POST data as a GET URI. Thus, it would certainly be risky to send passwords, credit card numbers, and other information unchanged in a regular HTTP request or response.
The application/x-form-urlencoded Content-Type shown in the previous section is an example of what are called Multipurpose Internet Mail Extensions (MIME). This is a specification that came about from the need to have Internet mail protocols support more than plain ASCII (US English) text. As it was recognized that these types would be useful beyond simple email, the number of types has grown, as has the number of places in which each is usedincluding our HTTP headers.
MIME types are divided into two partsthe media type and its subtypeand are separated by a forward slash character:
Common media types you will see are
Subtypes vary greatly for various media types, and you will frequently see some of the following combinations:
Some MIME types include additional attributes after the type to specify things, such as the character set they are using or the method via which they were encoded:
We will not often need MIME types throughout this book, but when we do, the values will mostly be the preceding application/x-form-urlencoded, text/html, and image/jpeg types.
Secure Sockets Layer (SSL)
As people came to understand the risks associated with transmitting plain text over the Internet, they started to look at ways to encrypt this data. The solution most widely used today across the WWW is Secure Sockets Layer (SSL) encryption. SSL is largely a transport level protocol. It is strictly a way to encode the TCP/IP traffic between two computers and does not affect the plain text HTTP traffic that is sent across the secure transaction.
What SSL Is
SSL is a protocol for secure network communications between computers on the Internet. It provides a way to encrypt the TCP/IP traffic on a particular port between two computers. This makes the traffic more difficult to view by people watching the network traffic between the two machines.
SSL is based on public key cryptography, a mechanism by which information is encrypted and decrypted using pairs of keys (one of which is a public key) instead of single keys. It is because of the existence of these public keys that we can use SSL and public key cryptography in securing traffic to and from web servers over HTTP, the variant of which is often called HTTPS.
What SSL Is Not
Before we get to the details of how SSL works, it is important to realize that SSL is not a panacea for security problems. While it can be used to make your web applications more secure, designers and developers alike should avoid falling into the trap of thinking that using SSL solves all of our security concerns.
It neither relieves us of the need to filter all of our input from external sources, nor prevents us from having to pay attention to our servers, software bugs, or other means through which people might attack our applications. Malicious users can still connect to our web application and enter mischievous data to try to compromise our security. SSL merely provides a way to prevent sensitive information from being transmitted over the Internet in an easily readable format.
A reasonably straightforward form of encryption that many users will be familiar with is symmetric encryption. In this, two sources share the same private key (like a password). One source encrypts a piece of information with this private key and sends the data to the other source, which in turn decrypts it with the same private key.
This encryption tends to be fast, secure, and reliable. Algorithms for symmetric encryption of chunks of data include DES, 3DES, RC5, and AES. While some of these algorithms have proved increasingly weak as the computing ability of modern computers has helped people crack passwords, others have continued to hold up well.
The big problem with symmetric encryption is the shared private key. Unless your computer knows the private key that the other computer is using, you cannot encrypt and decrypt traffic. The other computer cannot publish what the key is either, because anybody could use it to view the traffic between the two computers.
A solution to this problem exists in the form of an encryption innovation known as public key cryptography or asymmetric encryption. The algorithms for this type of encryption use two keys the public key and the private key. The private key is kept secret on one source (typically a server of some sort). The public key, on the other hand, is shared with anyone who wants it.
The magic of the algorithm is that once you encrypt data with one of the keys, only the holder of the other key can in turn decrypt that data. Therefore, there is little security risk in people knowing a server's public keythey can use it only to encrypt data that the server can decrypt with its private key. People viewing the traffic between two servers cannot analyze it without that private key.
One problem with public key cryptography is that it tends to be slower than symmetric encryption. Thus, most protocols that use it (such as SSL) make the connection initially with public key cryptography and then exchange a symmetric private key between the two computers so that subsequent communications can be done via the symmetric encryption methods. To prevent tampering with the encrypted data, an algorithm is run on the data before it is encrypted to generate a message address code (MAC), which is then sent along with the encrypted data. Upon decryption at the other end, the same algorithm is run on the unpacked data and the results are compared to make sure nobody has tampered with or corrupted the data (see Figure 13-2).
Figure 13-2. SSL in a nutshell.
How Web Servers Use SSL
As we can see, public key cryptography and SSL are perfect for our web server environment where we would like random computers to connect to our server but still communicate with us over an encrypted connection to protect the data being sent between client and application. However, we still have one problem to solvehow are we sure that we are connecting to the real server and not somebody who has made his computer appear like the server? Given that there are ways to do this, the fake server could give us its own public key that would let it decrypt and view all of the data being sent. In effect, we need to authenticate servers by creating a trusted link between a server and its public key.
This is done via the mechanism of digital certificates. These are files with information about the server, including the domain name, the company that requested the certificate, where it conducts business, and its public key. This digital certificate is in turn encrypted by a signing authority using its own private key.
Your client web browser stores the public keys for a number of popular signing authorities and implicitly trusts these. When it receives a certificate from a web site, it uses the public key from the authority that generated the certificate to decrypt the encoded signature that the signing authority added to the end with its private key. It also verifies that the domain name encoded in the certificate matches the name of the server to which you have just connected. This lets it verify that the certificate and server are valid.
Thus, we can be sure that the server is the one from it was supposed to come from. If somebody merely copied the certificate to a false server, he would never be able to decrypt traffic from us because he would not have the correct private key. He could not create a "fake" certificate because it would not be signed by a signing authority, and our computer would complain and encourage us not to trust it.
The sequence of events for a client connecting to a web server via SSL is as follows:
While this is a simplification of the process (these protocols are remarkably robust and have many additional details to prevent other attacks and problems), it is entirely adequate for our needs in this book.
We will see more about how we use SSL in PHP and our web servers in Chapter 17, "Securing Your Web Applications: Software and Hardware Security."
Other Important Protocols
There are a few other protocols that are widely used today which you might encounter while writing various web applications.
Simple Mail Transfer Protocol (SMTP)
The protocol by which a vast majority of e-mail (or spam) is sent over the Internet today is a protocol known as the Simple Mail Transfer Protocol (SMTP). It allows computers to send each other e-mail messages over TCP/IP and is sufficiently flexible to allow all sorts of rich message types and languages to pass over the Internet. If you plan to have your PHP web applications send e-mail, this is the protocol they will use.
Simple Object Access Protocol (SOAP)
A reasonably new protocol, the Simple Object Access Protocol (SOAP) is an XML-based protocol that allows applications to exchange information over the Internet. It is commonly used to allow applications to access XML Web Services over HTTP (or HTTP and SSL). This mechanism is based on XML, a simple and powerful markup language (see Chapter 23, "XML and XHTML") that is platform independent, meaning that anybody can write a client for a known service.
We will learn more about SOAP in Chapter 27, "XML Web Services and SOAP."