5.6. Defense Approaches | Internet Denial of Service: Attack and Defense Mechanisms

Given the basic dichotomy between prevention and reaction, the goals of DDoS defense, and the three types of locations where defenses can be located, we will now discuss the basic options on how to defend against DDoS attacks that have been investigated, to date. The discussion here is at a high level, with few examples of actual systems that have been built or actions that you can take, since the purpose of this material is to lay out for you the entire range of options. Doing so will then make it easier for you to understand and evaluate the more detailed defense information presented in subsequent chapters.

Some DDoS defenses concentrate on protecting you against DDoS. They try to ensure that your network and system never suffer the DDoS effect. Other defenses concentrate on detecting attacks when they occur and responding to them to reduce the DDoS effect on your site. We will discuss each in turn.

Most of these approaches are not mutually exclusive, and one can build a more effective overall defense by combining several of them. Using a layered approach that combines several types of defenses, at several different locations, can be more flexible and harder for an attacker to completely bypass. This layering includes host-level tuning and adequate resources, close-proximity network-level defenses, as well as border- or perimeter-level network defenses.

5.6.1. Protection

Some protection approaches focus on eliminating the possibility of the attack. These attack prevention approaches introduce changes into Internet protocols, applications and hosts, to strengthen them against DDoS attempts. They patch existing vulnerabilities, correct bad protocol design, manage resource usage, and reduce the incidence of intrusions and exploits. Some approaches also advocate limiting computer versatility and disallowing certain functions within the network stack (see, for example, [BR00, BR01]). These approaches aspire to make the machines that deploy them impervious to DDoS attempts. Attack prevention completely eliminates some vulnerability attacks, impedes the attacker's attempts to gain a large agent army, and generally pushes the bar for the attacker higher, making her work harder to achieve a DoS effect. However, while necessary for improving Internet security, prevention does not eliminate the DDoS threat.

Other protection approaches focus on enduring the attack without creating the DoS effect. These endurance approaches increase and redistribute a victim's resources, enabling it to serve both legitimate and malicious requests during the attack, thus canceling the DoS effect. The increase is achieved either statically, by purchasing more resources, or dynamically, by acquiring resources at the sign of a possible attack from a set of distributed public servers and replicating the target service. Endurance approaches can significantly enhance a target's resistance to DDoS the attacker must now work exceptionally hard to deny the service. However, the effectiveness of endurance approaches is limited to cases in which increased resources are greater than the attack volume. Since an attacker can potentially gather hundreds of thousands of agent machines, endurance is not likely to offer a complete solution to the DDoS problem, particularly for individuals and small businesses that cannot afford to purchase the quantities of network resources required to withstand a large attack.

Hygiene Hygiene approaches try to close as many opportunities for DDoS attacks in your computers and networks as possible, on the generally sound theory that the best way to enhance security is to keep your network simple, well organized, and well maintained.

Fixing Host Vulnerabilities Vulnerability DDoS attacks target a software bug or an error in protocol or application design to deny service. Thus, the first step in maintaining network hygiene is keeping software packages patched and up to date. In addition, applications can also be run in a contained environment (for instance, see Provos' Systrace [Pro03]), and closely observed to detect anomalous behavior or excess resource consumption.

Even when all software patches are applied as soon as they are available, it is impossible to guarantee the absence of bugs in software. To protect critical applications from denial of service, they can be duplicated on several servers, each running a different operating system and/or application version akin to biodiversity. This, however, greatly increases administrative requirements.

As described in Chapters 2 and 4, another major vulnerability that requires attention is more social than technical: weak or no passwords for remotely accessible services, such as Windows remote access for file services. Even a fully patched host behind a good firewall can be compromised if arbitrary IP addresses are allowed to connect to a system with a weak password on such a service. Malware, such as Phatbot, automates the identification and compromise of hosts that are vulnerable due to such password problems. Any good book on computer security or network administration should give you guidance on checking for and improving the quality of passwords on your system.

Fixing Network Organization Well-organized networks have no bottlenecks or hot spots that can become an easy target for a DDoS attack. A good way to organize a network is to spread critical applications across several servers, located in different subnetworks. The attacker then has to overwhelm all the servers to achieve denial of service. Providing path redundancy among network points creates a robust topology that cannot be easily disconnected. Network organization should be as simple as possible to facilitate easy understanding and management. (Note, however, that path redundancy and simplicity are not necessarily compatible goals, since multiple paths are inherently more complex than single paths. One must make a trade-off on these issues.)

A good network organization not only repels many attack attempts, it also increases robustness and minimizes the damage when attacks do occur. Since critical services are replicated throughout the network, machines affected by the attack can be quarantined and replaced by the healthy ones without service loss.

Filtering Dangerous Packets Most vulnerability attacks send specifically crafted packets to exploit a vulnerability on the target. Defenses against such attacks at least require inspection of packet headers, and often even deeper into the data portion of packets, in order to recognize the malicious traffic. However, data inspection cannot be done with most firewalls and routers. At the same time, filtering requires the use of an inline device. When there are features of packets that can be recognized with these devices, there are often reasons against such use. For example, a lot of rapid changes to firewall rules and router ACLs is often frowned upon for stability reasons (e.g., what if an accident leaves your firewall wide open?) Some types of Intrusion Prevention Systems (IPS), which act like an IDS in recognition of packets by signature and then filter or alter them in transit, could be used, but may be problematic and/or costly on very high bandwidth networks.

Source Validation

Source validation approaches verify the user's identity prior to granting his service request. In some cases, these approaches are intended merely to combat IP spoofing. While the attacker can still exhaust the server's resources by deploying a huge number of agents, this form of source validation prevents him from using IP spoofing, thus simplifying DDoS defense.

More ambitious source validation approaches seek to ensure that a human user (rather than DDoS agent software) is at the other end of a network connection, typically by performing so-called Reverse Turing tests.^[6] The most commonly used type of Reverse Turing test displays a slightly blurred or distorted picture and asks the user to type in the depicted symbols (see [vABHL03] for more details). This task is trivial for humans, yet very hard for computers. These approaches work well for Web-based queries, but could be hard to deploy for nongraphical terminals. Besides, imagine that you had to decipher some picture every time you needed to access an online service. Wouldn't that be annoying? Further, this approach cannot work when the communications in question are not supposed to be handled directly by a human. If your server responds directly to any kind of request that is not typically generated by a person, Reverse Turing tests do not solve your problem. Pings, e-mail transfers between mail servers, time synchronization protocols, routing protocol updates, and DNS lookups are a few examples of computer-to-computer interactions that could not be protected by Reverse Turing tests.

^[6] The original Turing test was passed when an artificial intelligence program could fool people into thinking it was human. The Reverse Turing test can only be passed by a human.

Finally, some approaches verify the user's legitimacy. In basic systems, this verification can be no more than checking the user's IP address against a list of legitimate addresses. To achieve higher assurance, some systems require that the user present a certificate, issued by some well-known authority, that grants him access to the service, preferably for a limited time only. Since certificate verification is a cryptographic activity, it consumes a fair amount of the server's resources and opens the possibility for another type of DDoS attack. In this attack, the attacker generates many bogus certificates and forces the server to spend resources verifying them.

Note that any agent machine that is capable of proving its legitimacy to the target will pass these tests. If nothing more is done by the target machine, once the test is passed an agent machine can perpetrate the DDoS attack at will. So an attacker who can recruit sufficient legitimate clients of his target as agents can defeat such systems. If you run an Internet business selling to the general public, you may have a huge number of clients who are able to prove their legitimacy, making the attacker's recruitment problem not very challenging.

This difficulty can perhaps be addressed by requiring a bit more from machines that want to communicate with your site, using a technique called proof of work.

Proof of Work

Some protocols are asymmetric they consume more resources on the server side than on the side of the client. Those protocols can be misused for denial of service. The attacker generates many service requests and ties up the server's resources. If the protocol is such that the resources are released after a certain time, the attacker simply repeats the attack to keep the server's resources constantly occupied.

One approach to protect against attacks on such asymmetric protocols is to redesign the protocols to delay commitment of the server's resources. The protocol is balanced by introducing another asymmetric step, this time in the server's favor, before committing the server's resources. The server requires a proof of work from the client.

The asymmetric step should ensure that the client has spent sufficient resources for the communication before the server spends its own resources. A commonly used approach is to send a client some puzzle to solve (e.g., [JB99, CER96]). The puzzle is such that solving it takes a fair amount of time and resources, while verifying the correctness of the answer is fast and cheap. Such puzzles are called one-way functions or trapdoor functions [MvOV96]. For example, a server could easily generate a large number and ask the client to factor it. Factoring of large numbers is a hard problem and it takes a lot of time. Once the client provides the answer, it is easy to multiply all the factors and see that they produce the number from the puzzle. After verifying the answer, the server can send another puzzle or grant the service request. Of course, the client machine runs software that automatically performs the work requested of it, so the human user is never explicitly aware of the need to solve the puzzle.

The use of proof-of-work techniques ensures that the client has to spend a lot more resources than the server before his request is granted. The amount of work required must not be sufficiently onerous for legitimate clients to mind or even usually notice, but it must be sufficient to slow down DDoS agents very heavily, making it difficult or perhaps impossible for them to send enough messages to the target to cause a DDoS effect.

At best, proof-of-work techniques make attacks using spoofed source addresses against handshake protocols less effective from small- to moderate-sized attack networks. (The exact efficiency of these techniques is not clear.) DDoS attacks are still feasible if the attacker uses much larger attack networks. Beyond simple flooding, there are two possible ways to use spoofed packets to perform an attack against a proof-of-work system. One way is for the agents to generate a lot of requests, then let the server send out puzzles to many fake addresses, thus exhausting its resources. Since puzzle generation consumes very few resources, the attacker would have to amass many agents to make this attack effective. The other way is for agents to generate a lot of false solutions to puzzles with spoofed source addresses (with or without previously sending in spoofed requests). Since the server spends some resources to verify the reply, this could be a way to tie up the server's resources and deny service. However, puzzle verification is also cheap for the server, and the attacker needs a huge number of agents to make this attack effective. (Keep in mind that some of today's attackers do, indeed, already have a huge number of agents.) The only "economical" way to deny service is for agents to act like legitimate clients, sending valid service requests and providing correct solutions for puzzles, to lead the server to commit his resources. Spoofing cannot be used in this attack, since the agent machine must receive the puzzles from the target to solve them. If the requests are spoofed, the puzzle will be delivered to another machine and the agent will not be able to provide the desired answer.

Elimination of IP spoofing facilitates use of other DDoS defenses that may help in the latter case. Thus, proof-of-work techniques would best be combined with other defensive techniques.

There are several requirements to make the proof-of-work approach practical and effective. First, it would be good if the approach were transparent to the clients and deployed only by the server. Popular services have no way to ensure that their vast client population simultaneously upgrades the software. For these services, a proof-of-work solution will be practical only if it can be unilaterally deployed. For instance, imagine a protocol that goes as follows:

1.	Client sends a request to the server.
2.	Server allocates some resources and sends a reply back to the client.
3.	Client allocates some resources and sends a reply back to the server.
4.	Server grants the request.

This protocol can be balanced unilaterally by modifying steps 2 and 4. In step 2, the server does not allocate any resources. Instead, he embeds some information from the request in the reply he sends to the client. When the client replies, the server recreates the original request information in step 4 and allocates resources. The proof of work on the client side consists not in solving some puzzle, but in allocating resources, just like the original protocol prescribes. For this solution to work, the client must repeat the embedded information in his reply, so that the server can use it in step 4.

Consider the TCP protocol as an example. TCP performs a three-way handshake at connection establishment. In its original form, this was an asymmetric protocol that required the server to commit resources early in the protocol. The server allocates resources (transmission control blocks from a fixed length table) upon receipt of a connection request (SYN packet). If the client never completes the connection, the server's resources remain allocated for a fairly long time. TCP SYN attacks, described in Chapter 4, allowed attackers to use this characteristic to perform a DoS attack with a relatively low volume of requests.

The TCP SYN cookie approach [Ber] modifies this protocol behavior to require the client to commit his resources first. The server encodes the information that would normally be stored in the transmission control block in the server's initial sequence number value. The server then sends this value in the connection reply packet (SYN-ACK) to the client and does not preserve any state. If the client completes the connection (and allocates its own transmission control block locally), the server retrieves encoded information from client's connection-completion packet and only then allocates a transmission control block. If the client never completes the connection, the server never allocates resources for this connection.

The second requirement for proof-of-work solutions is that the required work has to be equally hard for all clients, regardless of their hardware. Otherwise, an attacker who has compromised a powerful machine might be able to solve puzzles very quickly, thus generating enough requests to overwhelm the server despite solving all the puzzles. This requirement is hard to meet in the case of protocols that send out puzzles, because puzzle solving is computationally intensive and much easier for faster processors. Unless the amount of work is reasonable for even the least powerful legitimate client, a proof-of-work solution causes performance degradations even when no attack is ongoing. Some recent research [LC04] suggests that proofs hard enough to cause problems for attackers are so hard that many legitimate clients are hurt.

The third requirement states that theft or replay of answers must be prevented. In other words, a client himself must do the work. He cannot save and reuse old answers, and he cannot steal somebody else's answer. Puzzle-generation techniques usually meet these requirements by generating time-dependent puzzles, and making them depend on the client identity.

Ultimately, proof-of-work systems cannot themselves defend against attacks that purely flood network bandwidth. Until the server machine establishes that the incoming message has not provided the required proof of work for a particular source, messages use up network resources. Similarly, putative (but false) proofs of work use up resources until their deception is discovered. Lastly, these techniques only work on protocols involving session setup (not UDP services, for example).

Resource Allocation

Denial of service is essentially based on one or more attack machines seizing an unfair share of the resources of the target. One class of DDoS protection approaches based on resource allocation (or fair resource sharing) approaches seeks to prevent DoS attacks by assigning a fair share of resources to each client. Since the attacker needs to steal resources from the legitimate users to deny service, resource allocation defeats this goal.

A major challenge for resource allocation approaches is establishing the user's identity with confidence. If the attacker can fake his identity, he can exploit a resource allocation scheme to deny service. One attack method would be for the attacker to fake a legitimate user's identity, and take over this user's resources. Another attack method is to use IP spoofing to create a myriad of seemingly legitimate users. Since there are not enough resources to grant each user's request, some clients will have to be rejected. Because fake users are much more numerous than the legitimate ones, they are likely to grab more resource slots, denying the service.

The common approach for establishing the user's identity is to couple resource allocation with source validation schemes. Another approach is to combine resource allocation with a proof of work. Once the client submits the correct proof of work, the server is assured not only of the client's identity but also of his commitment to this communication. Resource allocation can then make sure that no client can monopolize the service.

Bear in mind that the attacker can still perform a successful attack, in spite of a strong resource allocation scheme. Just like with proof-of-work or source validation approaches, a large number of attack agents can overload the system if they behave like the legitimate users. However, resource allocation significantly raises the bar for the attacker. He needs a lot more agents than before that can only send with a limited rate, and they must abstain from IP spoofing to pass the identity test. This makes the game much more balanced for the defenders than before.

However, unless resource allocation schemes are enforced throughout the entire Internet, the attacker can still attempt to flood the point at which resource allocations are first checked. Most such schemes are located near the target, often at its firewall or close to its connection to the Internet. At that point, the function that determines the owner of each message and performs accounting can reject incoming messages that go beyond their owners' allocations, protecting downstream entities from flooding. But it cannot prevent itself from being flooded. A resource allocation defense point that can only handle 100 Mbps of incoming traffic can be overwhelmed by an attacker who sends 101 Mbps of traffic to it, even if he has not been allocated any downstream resources at all.

A further disadvantage of this approach is that it requires users to divulge their identities in verifiable ways, so that their resource usage can be properly accounted for. Many users are understandably reluctant to provide these kinds of identity assurances when not absolutely necessary. A DDoS solution that requires complete abandonment of all anonymous or pseudonymous interactions [DMS04] in the Internet has a serious downside. Some researchers are examining the use of temporary pseudonyms or other identity-obscuring techniques that might help solve this problem, but it is unclear if they would simultaneously prevent an attacker from obtaining as many of these pseudonyms as he needs to perpetrate his attack.

Hiding

None of the above approaches protect the server from bandwidth overload attacks that clog incoming links with random packets, creating congestion and pushing out the legitimate traffic. Hiding addresses this problem. Hiding obscures the server's or the service's location. As the attacker does not know how to access the server, he cannot attack it anymore. The server is usually hidden behind an "outer wall of guards." Client requests first hit this wall, and then clients are challenged to prove their legitimacy. Any source validation or proof-of-work approach can be used to validate the client. The legitimacy test has to be sufficiently reliable to weed out attack agent machines. It also has to be distributed, so that the agents cannot crash the outer-wall guards by sending too many service requests. Legitimate clients' requests are then relayed to the server via an overlay network. In some approaches, a validated client may be able to send his requests more directly, without going through the legitimacy test for every message or connection. In the extreme, trusted and preferred clients are given a permanent "passkey" that allows them to take a fast path to the service without ever providing further proof of legitimacy. There are clear risks to that extreme. An example hiding approach is SOS [KMR02], discussed in more detail in Chapter 7.

A poor man's hiding scheme actually prevented one DDoS attack. The Code Red worm carried, among its other cargo, code intended to perform a DDoS attack on the White House's Web site. However, the worm contained a hard-coded IP address for the victim's Web site. When the worm was captured and analyzed, this IP address was identified and the target was protected simply by changing its IP address. By sending out routing updates that caused packets sent to the old address to be dropped, the attack packets would not even be delivered to the White House's router. Had the worm instead used a DNS name to identify its victim, a DNS name resolution lookup would have occured. This would mean both worms and legitimate clients would be directed to any new IP address (thus making a change of the DNS host name to IP address mapping an ineffective solution.)^[7] This approach is not generally going to help you against a reasonably intelligent DDoS attacker, but it illustrates the basic idea.

^[7] On the other hand, if DDoS agents do DNS lookups for a particular victim IP address, they can be discovered by watching for these queries on local name servers. They may then be individually blocked at the site's border using filtering methods.

Hiding approaches show a definite promise, but incur high cost to set up the overlay network and distribute guard machines all over the Internet. Further, client software is likely to need modification for various legitimacy tests. All this extra cost makes hiding impractical for protection of public and widely accessed services, but well suited for protection of corporate or military servers. A major disadvantage of hiding schemes is that they rely on the secrecy of the protected server's IP address. If this secret is divulged, attackers can bypass the protection by sending packets directly to that address, and the scheme can become effective again only by changing the target's address.

Some hiding solutions have been altered to provide defense benefits even when the protected target's address is not a secret. More details can be found in Chapter 7, but, briefly, the target's router is configured to allow messages to be delivered to the target only if they originate from certain hosts in a special overlay network. Legitimate users must prove themselves to the overlay network, while attackers trying to work through that network are filtered out. Whether this scheme can provide effective protection is uncertain at this time. At the least, flooding attacks on the router near the target will be effective if they can overcome that router's incoming bandwidth.

Overprovisioning

Overprovisioning ensures that excess resources that can accommodate both the attack and the legitimate traffic are available, thus avoiding denial of service. Unlike previous approaches that deal with attack prevention, overprovisioning strengthens the victim to withstand the attack.

The most common approach is purchasing abundant incoming bandwidth and deploying a pool of servers behind a load balancer. The servers may share the load equally at all times, or they may be divided into the primary and backup servers, with backup machines being activated when primary ones cannot handle the load. Overprovisioning not only helps withstand DDoS attacks, but also accommodates spikes in the legitimate traffic due to sudden popularity of the service, so-called flash crowds. For more information on flash crowds and their similarity to DDoS attacks see [JKR02] and the discussion in Chapter 7.

Another approach is to purchase content distribution services from an organization that owns numerous Web and database servers located all over the Internet. Critical services are then replicated over these distributed servers. Client requests are redirected to the dedicated content distribution server, which sends them off to the closest or the least loaded server with the replicated service for processing. The content distribution service may dynamically increase its replication degree of a user's content if enough requests are generated, possibly keeping ahead of even rapidly increasing volumes of DDoS requests.

After the attack on the DNS root servers in October 2002, many networks operating these services set up extra mirror sites for their service at geographically distributed locations. For example, ISC, which runs the DNS root server designated as the F server, expanded its mirroring to 20 sites on five continents, as of this writing, with plans to expand even further. The fairly static nature of the information stored at DNS root servers makes them excellent candidates for this defense technique.

Overprovisioning is by far the most widely used approach for DDoS defense. It raises the bar for the attacker, who must generate a sufficiently strong attack to overwhelm abundant resources. However, overprovisioning does not work equally well for all services. For instance, content distribution is easily implemented for static Web pages, but can be quite tricky for pages with dynamic content or those that offer access to a centralized database. Further, the cost of overprovisioning may be prohibitive for small systems. If a system does not usually experience high traffic volume, it needs modest resources for daily business. Purchasing just a bit more will not help fend off many DDoS attacks, while purchasing a lot more resources is wasteful, as they rarely get used. Finally, while it is more difficult to perpetrate a successful attack against a well-provisioned network, it is not impossible. The attacker simply needs to collect more agents possibly a trivial task with today's automated tools for malicious code propagation. With known attack networks numbering 400,000 or more, and some evidence suggesting the existence of million-node armies (see http://www.ladlass.com/archives/001938.html), one might question whether it is sufficient to prepare for DDoS attacks by overprovisioning.

5.6.2. Attack Detection

If protection approaches cannot make DDoS attacks impossible, then the defender must detect such attacks before he can respond to them. Even some of the protection approaches described above require attack detection. Certain protection schemes are rather expensive, and some researchers have suggested engaging them only when an attack is taking place, which implies the need for attack detection.

Two major goals of attack detection are accuracy and timeliness.

Accuracy is measured by how many detection errors are made. A detection method can err in two ways. It can falsely detect an attack in a situation when no attack was actually happening. This is called a false positive. If a system generates too many false positives, this may have dire consequences, as discussed in Section 5.3.2 The other way for a detection method to err is to miss an attack. This is called a false negative. While any detection method can occasionally be beaten by an industrious and persistent attacker, frequent false negatives signify an incomplete and faulty detection approach.

As the attack detection drives the engagement of the response, the performance of the whole DDoS defense system depends on the timeliness of the detection. Attacks that are detected and handled early may even be transparent to ordinary customers and cause no unpleasant disruptions. Detection after the attack has inflicted damage to the victim fails to prevent interruptions, but minimizes their duration by quickly engaging an appropriate response.

The difficulty of attack detection depends to a great extent on the deployment location and the desired detection speed. Detecting an attack at the victim site is trivial after the DoS effect becomes pronounced. It is like detecting that the dam has broken once your house is underwater. Usually, the network is either swamped by a sudden traffic flood or some of its key servers are slow or have crashed. This situation is so far from the desired that the crudest monitoring techniques can spot it and raise an alert. However, denial of service takes a toll on network resources and repels customers. Even if the response is promptly engaged, the disruption is bad for business. It is therefore desirable to detect an attack as early as possible, and respond to it, preventing the DoS effect and maintaining a good face to your customers. Although agent machines are usually synchronized by commands from a central authority and engaged all at once, the attack traffic will take some time (several seconds to a few minutes) to build up and consume the victim's resources. This is the window where early detection must operate. What is desired is to detect that the water is seeping through a dam and evacuating the houses downstream minutes before the dam breaks.

The sensitivity and accuracy of attack detection deteriorate as monitoring is placed farther away from the victim. This is mostly due to incomplete observations, as monitoring techniques at the Internet core or close to attack sources cannot see all traffic that a victim receives, and cannot closely observe the victim's behavior to spot problems. This is like trying to guess whether a dam will break by checking for leaks at a single spot in the dam. It may happen that the other places leak profusely while the one we are monitoring is dry and safe. It also may happen that all observed places leak very little and seem innocuous, but the total amount of water leaked is enough to flood the houses downstream.

Core-based detection techniques must be very crude, as core router resources are limited. This further decreases the accuracy. On the other hand, source-based detection techniques can be quite complex. Fortunately, since sources see only moderate traffic volumes even during the attack, they can afford to engage in extensive statistics gathering and sophisticated profiling.

Since target-based detection is clearly superior to core- and source-based attempts, why do we have detection techniques located away from the target? The reason lies in the fact that autonomous DDoS defense is far simpler and easier to secure than a distributed defense. DDoS response near the source is most effective and incurs the least collateral damage, and co-locating a detection module with the response builds an autonomous defense at the spot. Similarly, core-based response has the best yield, since a core deployment at a few response points can control a vast number of attack streams, irrespective of the source and victim locations. Adding a detection mechanism to corebased response builds autonomous and stable defense in the Internet core. Balancing the advantages and disadvantages of various detection locations is another complex task for defenders.

Once the attack has been successfully detected, the next crucial task is attack characterization. The detection module must be able to precisely describe the offending traffic, so that it can be sifted from the rest by the response module. Legitimate and attack traffic models used in detection, sometimes coupled with additional statistics and profiling, guide the attack characterization. The goal is to obtain a list of parameters from the packet header and contents, along with a range of values that indicate a legitimate or an attack packet. Each incoming packet is then matched against the list, and the response is selectively applied to packets deemed to be a likely part of an attack. Attack characterization is severely hindered by the fact that the attack and legitimate traffic look alike. However, good attack characterization is of immense importance to DDoS defense, as it determines the amount of collateral damage and the effectiveness of the response.

Three main approaches to attack detection are signature, anomaly, and misbehavior detection.

Signature Detection

Signature detection builds a database of attack characteristics observed in the past incidents attack signatures. All incoming packets are compared against this database, and those that match are filtered out. Consequently, the signature must be carefully crafted to precisely specify the attack, but also, ensure that no legitimate traffic generates a match. The goal is reach a zero false-positive rate, but the effectiveness of signature detection is limited to those attacks that involve easy-to-match packet attributes. For example, the Land DoS attack [CER98a] sends packets whose source IP address and source port are the same as their destination IP address and port, causing some TCP/IP implementations to crash. As no legitimate application will ever need to send a similarly crafted packet, a check for equality of source and destination IP address and port can form a valid attack signature. This kind of check is so simple that it should always be performed. Other signatures can be much more complex.

Since vulnerability attacks can be successful with very few packets, signature detection that accurately pinpoints these packets (and helps filtering mechanisms to surgically remove them from the input stream) is an effective solution. On the other hand, signature detection cannot help with flooding DDoS attacks that generate random packets similar to legitimate traffic.

In addition to victim-end deployment, signature detection can be used at the source networks to identify the presence of agent machines. One approach is to monitor control traffic between agent machines and their handlers to look for telltale signs of DDoS commands. Most DDoS tools will format their control messages in a specific manner, or will embed some string in the messages. This can be used as a signature to single out DDoS control traffic. For example, one of the popular DDoS tools, TFN2K [CERb], pads all control packets with a specific sequence of ones and zeros. Modern DDoS tools use encrypted channels for control messages or use polymorphic techniques, both of which defeat signature-based detection of control traffic.

Another approach is to look for listening network ports used for control. Some DDoS tools using the handler/agent model require agents to actively listen on a specific port. While this open port can easily be changed by an attacker, there are a handful of widely popular DDoS tools that are usually deployed without modification. Hence, a tool-specific port can frequently make a good signature for agent detection. Detecting open ports requires port scanning suspected agent machines. Most of the modern DDoS tools evade port-based detection through use of IRC channels (sometimes encrypted) for control traffic. All agents and the attacker join a specific channel to send and receive messages. While the mere use of IRC does not provide a signal that a machine is involved in a DDoS attack, if the DDoS agents use cleartext messages on the channel (as many actually do), signature detection can be performed by examining the messages sent over IRC channels. If the use of IRC is prohibited on your machines (making the presence of IRC traffic a clear signal of problems), the attacker can instead embed commands in HTTP traffic or other forms of traffic that your network must permit.

A more sophisticated detection approach is to monitor flows to and from hosts on the network and to detect when a host that formerly acted only as a client (i.e., establishing outbound connections to servers) suddenly starts acting like a server and receiving inbound connections. Similarly, you can check if a Web server that has only received incoming connections to HTTP and HTTPS service ports suddenly behaves like an IRC server or a DNS server. Some of these techniques step across the boundaries of signature detection into the realm of anomaly detection, discussed later in this section. Stepping stones may also be detected using these techniques (by correlating inbound and outbound flows of roughly equal amounts). Note that some attacker toolkits do things like embed commands in other protocols (e.g., using ICMP to tunnel commands and replies), or may use TCP as a datagram protocol, fooling some defense tools into thinking that the fact that there was never an established TCP connection implies that no communication is occurring.

Finally, it is possible to detect agents by examining each machine, looking for specific file names, contents, and locations. All popular and widely used DDoS tools have been carefully dissected and the detailed description of the tool-specific ports, control traffic features, and file signatures can be found at the CERT Coordination Center Web page [CERe] or at Dave Dittrich's DDoS Web page [Ditd]. Of course, one cannot look for file details on machines one does not own and control. Also, attackers may try to avoid detection by installing a rootkit at the subverted machine to hide the presence of malicious files and open ports.

Intrusion Detection Systems (IDSs) can also be used to detect compromises of potential agent machines. They examine the incoming traffic looking for known compromise patterns and drop the suspicious packets. In addition to preventing subversion for DDoS misuse, they protect the network from general intruders and promote security [ACF⁺99]. One major drawback of simple IDS solutions is that they often have a high alert rate, especially false-positive alerts. Newer IDSs employ combinations of operating system detection and service detection, correlating them with attack signatures to weed out obvious false alarms, such as a Solaris/SPARC-based attack against a DNS server that is directed at an Intel/Windows XP desktop that never had a DNS server in the first place.

Anomaly Detection

Anomaly detection takes the opposite approach from signature detection. It acknowledges the fact that malicious behaviors evolve and that a defense system cannot predict and model all of them. Instead, anomaly detection strives to model legitimate traffic and raise an alert if observed traffic violates the model. The obvious advantage of this approach is that previously unknown attacks can be discovered if they differ sufficiently from the legitimate traffic. However, anomaly detection faces a huge challenge. Legitimate traffic is diverse new applications arise every day and traffic patterns change. A model that specifies legitimate traffic too tightly will generate a lot of false positives whenever traffic fluctuates. On the other hand, a loose model will let a lot of attacks go undetected, thus increasing the possibility of false-negatives. Finding the right set of features and a modeling approach that strikes a balance between false positives and false negatives is a real challenge.

Flow monitoring with correlation, described in a previous section, is another form of anomaly detection, which also combines features of behavioral models.

Behavioral Models Behavioral models select a set of network parameters and learn the proper value ranges of these parameters by observing network traffic over a long interval. They then use this baseline model to evaluate current observations for anomalies. If some parameter in the observed traffic falls out of the baseline range by more than a set threshold, an attack alert is raised. The accuracy and sensitivity of a behavioral model depend on the choice of parameters and the threshold value. The usual approach is to monitor a vast number of parameters, tuning the sensitivity (and the false-positive rate) by changing threshold values. To capture the variability of traffic on a daily basis (for instance, traffic on weekends in the corporate network will have a different behavior than weekday traffic), some detection methods model the traffic with a time granularity of one day.

Behavioral models show definite promise for DDoS detection, but they face two major challenges:

Model update. As network and traffic patterns evolve over time, models need to be updated to reflect this change. A straightforward approach to model update is to use observations from the past intervals when no attack was detected. However, this creates an opportunity for the attacker to mistrain the system by a slow attack. For instance, suppose that the system uses a very simple legitimate traffic model, recording just the incoming traffic rate. By sending the attack traffic just below the threshold for a long time, the attacker can lead the system to believe that conditions have changed and increase the baseline value. Repeating this behavior, the attacker will ultimately overwhelm the system without raising the alert.
While these kinds of training attacks are rare in the wild, they are quite possible and easy to perpetrate. A simple fix is to sample the observations at random times and derive model updates from these samples. Another possible fix is to have a human review the updates before they are installed.
Attack characterization. Since the behavioral models generate a detection signal that simply means "something strange is going on," another set of techniques is necessary for traffic separation. One possible and frequently used approach is to profile incoming packets looking for a set of features that single out the majority of packets. For instance, assume that our network is suddenly swamped by traffic, receiving 200 Mbps instead of the usual 30 Mbps. Through careful observation we have concluded that 180 Mbps of this traffic is UDP traffic, carrying DNS responses. Using UDP/DNS-response characterization to guide filtering, we will get rid of the flood, but likely lose some legitimate DNS responses in the process. This is the inherent problem of behavioral models, but it can be ameliorated to a great extent by a smart choice of the feature set for traffic separation. Another possible approach is to create a list of legitimate clients' source addresses, either based on past behavior or through some offline mechanism. This approach will let some attack traffic through when the attacker spoofs an address from the list.

Standard-Based Models Standard-based models use standard specifications of protocol and application traffic to build legitimate models. For example, the TCP protocol specification describes a three-way handshake that has to be performed for TCP connection setup. An attack detection mechanism can use this specification to build a model that detects half-open TCP connections or singles out TCP data traffic that does not belong to an established connection. If protocol and application implementations follow the specification, standard-based models will generate no false positives. Not all protocol and application implementations do so, however, as was pointed out by Ptacek and Newsham [PN98].

The other drawback of the standard-based models is their granularity. Since they model protocol and application traffic, they have to work at a connection granularity. This potentially means a lot of observation gathering and storage, and may tax system performance when the attacker generates spoofed traffic (thus creating many connections). Standard-based models must therefore deploy sophisticated techniques for statistics gathering and periodic cleanup to maintain good performance.

While standard-based models protect only from those attacks that clearly violate the standard, they guarantee a low false-positive rate and need very little maintenance for model update, except when a new standard is specified. The models can effectively be used for traffic separation by communicating the list of misbehaving connections to the response system.

Misbehavior Modeling

Instead of trying to model normal behavior and match ongoing behavior to those models, one can model misbehavior and watch for its occurrence. The simple method of detecting DDoS attacks at the target is misbehavior modeling at its most basic: The machine is receiving a vast amount of traffic and is not capable of keeping up. Yep, that's a DDoS attack. At one extreme, misbehavior modeling is the same as signature-based detection: Receiving a sufficiently large number of a particular type of packet on a particular port with a particular pattern of source addresses may be both a misbehavior model and a signature of the use of a particular attack toolkit. But misbehavior modeling can be defined in far more generic terms that would not be recognized as normal signatures. At the other extreme, misbehavior modeling is no different than anomaly modeling: If it is not normal, it is DDoS. But misbehavior modeling, by trying to capture the characteristics of only DDoS attacks, characterizes all other types of traffic, whether they have actually been observed in the past or not, as legitimate. True misbehavior modeling falls in the range between these extremes.

The challenge in misbehavior modeling is finding characteristics of traffic that are nearly sure signs that a DDoS attack is going on, beyond the service actually failing under high load. Perhaps a sufficiently large ramp-up in traffic over a very short period of time could signal a DDoS attack before the machine was actually overwhelmed, but perhaps it signals only a surge in interest in the site or a burst of traffic that was delayed somewhere else in the network and has suddenly been delivered in bulk. Perhaps a very large number of different addresses sending traffic in a very short period of time signals an attack, but perhaps it only means sudden widespread success of your Web site. It is unclear if it is possible to model DDoS behavior sufficiently well to capture it early without falsely capturing much legitimate behavior. (Such mischaracterization could be either harmless or disastrous, depending on what you do when a DDoS attack is signaled.)

5.6.3. Attack Response

The goal of attack response is to improve the situation for legitimate users and mitigate the DoS effect. There are three major ways in which this is done:

Traffic policing. The most straightforward and desirable response to a DoS attack is to drop offending traffic. This makes the attack transparent both to the victim and to its legitimate clients, as if it were not happening. Since attack detection and characterization are sometimes inaccurate, the main challenge of traffic policing is deciding what to drop and how much to drop.
Attack traceback. Attack traceback has two primary purposes: to identify agents that are performing the DDoS attack, and to try to get even further back and identify the human attacker who is controlling the DDoS network. The first goal might be achievable, but is problematic when tens of thousands of agents are attacking. The latter is nearly impossible today, due to the use of stepping stones. These factors represent a major challenge to traceback techniques. Compounding the problem is the inability of law enforcement to deal with the tens, or hundreds of thousands, of compromised hosts scattered across the Internet, which also means scattered across the globe. Effective traceback solutions probably need to include components that automatically police traffic from offending machines, once they are found. See Chapter 7 for detailed discussion of traceback techniques.
Service differentiation. Many protection techniques can be turned on dynamically, once the attack is detected, to provide differentiated service. Clients are presented with a task to prove their legitimacy, and those that do receive better service. This approach offers a good economic model. The server is generally willing to serve all the requests. At times of overload, the server preserves its resources and selectively serves only the VIP clients (those who are willing to prove their legitimacy) and provides best-effort service to the rest. A challenge to this approach is to handle attacks that generate a large volume of bogus legitimacy proofs. It may be necessary to distribute the legitimacy verification service to avoid the overload.

As each response has its own set of limitations, it is difficult to compare them to each other. Service differentiation creates an opportunity for the legitimate users to actively participate in DDoS defense and prove their legitimacy. This is the most fair to the customers, as they control the level of service they receive, not relying on (possibly faulty) attack characterization at the victim. On the other hand, service differentiation requires changes in the client software, which may be impractical for highly popular public services. Traceback requires a lot of deployment points in the core, but places the bulk of complexity at the victim and enables response long after the attack has ended. Traffic policing is by far the most practical response, as its minimum number of deployment points is one in the vicinity of the victim. However, traffic policing relies on sometimes inaccurate attack characterization and is bound to inflict collateral damage.

Finally, there is no need to select a single response approach. Traceback and traffic policing can be combined to drop offending traffic close to its sources. Traffic policing can work with service differentiation, offering different policies for different traffic classes. Traceback can bring service differentiation points close to the sources, distributing and reducing the server load.

Traffic Policing

Two main approaches in traffic policing are filtering and rate limiting. Filtering drops all the packets indicated as suspicious by the attack characterization, while rate limiting enforces a rate limit on all suspicious packets. The choice between these two techniques depends on the accuracy of attack characterization. If the accuracy is high, dropping the offending traffic is justified and will inflict no collateral damage. When the accuracy is low, rate limiting is definitely a better choice, as some legitimate packets that otherwise would have been dropped are allowed to proceed to the victim. This will reduce collateral damage and facilitate prompt recovery of legitimate traffic in the case of false positives. Signature detection techniques commonly invoke a filtering response, as the offending traffic can be precisely described, while anomaly detection is commonly coupled with rate limiting as a less restrictive response.

The main challenge of traffic policing is to minimize legitimate traffic drops one form of collateral damage. There are two sources of inaccuracy that lead to this kind of collateral damage: incorrect attack characterization and false positives. If the attack characterization cannot precisely separate the legitimate from the attack traffic, some legitimate packets will be dropped every time the response is invoked. The greater the inaccuracy, the greater the collateral damage. False positives needlessly trigger the response. The amount of the collateral damage again depends on the characterization accuracy, but false alarms may mislead the characterization process and thus increase legitimate drops.

How bad is it to drop a few legitimate packets? At first glance, we might conclude that a small rate of legitimate drops is not problematic, as the overwhelming majority of Internet communication is conducted using TCP. Since TCP is a reliable transmission protocol, dropped packets will be detected and retransmitted shortly after they were lost, and put in order at the receiving host. The packet loss and the remedy process should be transparent to the application and the end user. This works very well when there are only a few drops, once in a while. The mechanisms ensuring reliable delivery in the TCP protocol successfully mask isolated packet losses. However, TCP performance drops drastically in the case of sustained packet loss, even if the loss rate is small. The reason for this lies in the TCP congestion control mechanism, which detects packet loss as an early sign of congestion. TCP's congestion control module responds by drastically reducing the sending rate in the effort to alleviate the pressure at the bottleneck link. The rate is reduced exponentially with each loss and increased linearly in the absence of losses. Several closely spaced packet drops can thus quickly reduce the connection rate to one packet per sending interval. After this point, each loss of the retransmitted packet exponentially increases the sending interval. Overall, sustained packet loss makes the connection send less and with a reduced frequency.

While very effective in alleviating congestion, this response severely decreases the competitiveness of legitimate TCP traffic in case of a DoS attack. In the fight for a shared resource, more aggressive traffic has a better chance to win. The attack traffic rate is usually unrelenting, regardless of the drops, while the legitimate traffic quickly decreases to a trickle, thus forfeiting its fighting chance to get through. Rate limiting for DDoS response introduces another source of drops in addition to congestion, trying to tip the scale in favor of the legitimate traffic. If the rate limiting is not sufficiently selective, packet drops due to collateral damage will have the same ill effect on the legitimate connection as congestion drops did. Even if the congestion is completely resolved (the response has successfully removed the attack traffic), those legitimate connections that had severe drops will take a long time to recover and may be aborted by the application. It is therefore imperative to eliminate as many legitimate drops as possible, not only by making sure that the response is promptly engaged, but also by increasing its selectiveness.

The traffic-policing component can be placed anywhere on the attack path. Placing the response close to the victim ensures the policing of all attack streams with a single response node, but may place substantial burden on the DDoS defense system when the victim is subjected to a high-volume flood. Victim-end deployment also maximizes the chances for collateral damage, if rate limiting is the response of choice, as imperfect drop decisions affect all traffic reaching the victim. Better performance can be achieved by identifying those network paths that likely carry the attack traffic and pushing the rate limit along those paths as close to the sources as possible. This localizes the effect of erroneous drops only to those legitimate clients who share the path to the victim with an attacker. Unfortunately, this approach causes the number of the response points needed to completely control the attack to grow, as a response node must be installed on each identified path.

One technique currently used to counter large attacks that last for a long time is to start by trying to filter locally. If that is not sufficient, the victim then contacts his upstream network provider to request the installation of filters there. In principle, this manual pushing of filters back further into the network could continue indefinitely, but since each step requires human contact and intervention, it rarely is done too far into the network. One example of successful use of this technique occurred during the DDoS attack on the DNS root servers. One root server administrator contacted his backbone provider to install filters to drop certain types of packets in the attack stream, thus reducing the attack traffic on the link leading to his root server. The manual approach has some strong limitations, however. One must carefully characterize the packets to avoid collateral damage, and not all network providers will respond quickly to all customers' requests to install filters. This issue is discussed in more detail in Chapter 6.

Attack Traceback

Attack traceback has two primary purposes: to identify (and possibly shut down) agents that are implementing the actual DDoS attack, and to try to get even further back and identify the human attacker who is controlling the DDoS network. Traceback would thus be extremely helpful not only in DDoS defense, but also in cases of intrusions and worm infections when the attack is inconspicuous, contained in a few packets, and may be detected long after the attack ends. Traceback techniques enable the victim to reassemble the path of the attack, with help of the core routers. In packet-marking techniques [SPS⁺01, DFS01, SWKA00], routers tag packets with extra information stating, "The packet has passed through this router." In ICMP traceback [BLT01], additional control information is sent randomly to the victim, indicating that packets have passed through a given router. The victim uses all such information it receives to deduce the paths taken by attack packets. In hash-based traceback [SPS⁺01], routers remember each packet they have seen for a short time and can retrieve this knowledge in response to a victim's queries. Obviously, all these approaches place a burden on the intermediate routers, either to generate additional traffic, or to rewrite a portion of the traffic they forward, or to dedicate significant storage to keep records of packets they have seen. More overhead is incurred by the victim when it tries to reassemble the attack path. This process may be very computationally intensive and lead to additional control traffic between the victim and the core routers. As the attack becomes more distributed, the cost of the traceback increases.

Another drawback is tracing precision. It is impossible to identify the actual subverted machine. Rather, several networks in the vicinity of the attacking machine are identified. In a sparse deployment of traceback support at core routers, the number of suspect networks is likely to be very high. While this information is still beneficial if, for instance, we want to push a traffic-policing response closer to the sources, it offers little assistance to law enforcement authorities or to filtering rule generation.

An open issue is what action to take when tracing is completed. An automatic response, such as filtering or rate limiting, is the best choice, as the number of suspect sites is likely to be too large for human intervention. In this case, suspected networks that are actually innocent (i.e., networks that do not host agents but share a path with a network that does) will have their packets dropped. This is hardly fair. Another point worth mentioning is that even a perfect tracing approach up to the sending machine is useless in a reflector DoS attack. In this case, the machine sending problematic traffic is simply a public server that responds to seemingly legitimate queries. Since such servers will not themselves spoof their IP address, identification of them is trivial and no tracing is needed.

As noted in Chapter 4, even a workable traceback scheme has two other significant problems. First, in the face of traceback of DDoS flood traffic, it gets you only to the agents, not all the way back to the actual attacker (through all her handlers, IRC proxies, or login stepping stones). This may offer some opportunity to relieve the immediate attack but does not necessarily help catch the actual attacker or prevent her from making future attacks on you. Second, if a successful attack can be waged using only a few hundred or even a few thousand hosts, yet the attacker can gain access to 400,000 hosts, she can simply cycle through attack networks and force the victim to repeat the traceback and flood mitigation steps. Because these actions occur on human timescales today, the attacker would consume not only computer resources of the victim, but also human resources. Even at future automated speeds, the difficulties and costs of dealing with this sort of cycling attack could be serious. Having some kind of understanding of how a particular attack is being waged would help the victim to know when such a tactic is in use and to adjust its response accordingly.

Service Differentiation

As mentioned above, some of the protection approaches can be engaged dynamically, when an attack is detected, to provide differentiated service to those clients who can prove their legitimacy. A dynamic deployment strategy has an advantage over static deployment, as operational costs are paid only when needed. There is an additional advantage in cases when the protection approach requires software changes at the client side. Were such approaches engaged statically, the server would lose all of its legacy clients. With dynamic engagement, legacy clients are impacted only when the attack is detected, and even then the effect is degradation of their service, since the protection mechanism favors those clients that deploy software changes. As the attack subsides, old service levels are restored.

Source validation approaches can be used to differentiate between preferred and ordinary clients, and to offer better service to the preferred ones during the attack. Proof-of-work approaches can be engaged to challenge users to prove their legitimacy, and resources can be dedicated exclusively to users whose legitimacy has been proven during the attack.