11.1. Incident Response | Practical Guide to Software Quality Management (Artech House Computing Library)

< Day Day Up >

A security incident can be reported in a number of ways. An IDS system may detect a compromised system. A user may report that she has had files deleted or modified. And sometimes systems administrators will simply have a "bad feeling" about a host and ultimately realize it was victim of an attack. Regardless of how an incident is detected, the key to dealing with an incident is to be prepared and understand what your next steps are. Before an incident occurs, you need to prepare your tools, your process, and your coworkers for what needs to occur.

Incident response and forensics can be a detailed and difficult operation. For many, the activities and details discussed in this chapter will be sufficient. However, for those in more sensitive industries such as healthcare or finance, this chapter should serve only as a primer. Those who want a more rigorous treatment of incident response should consult the books and articles listed in the Section 11.5 at the end of this chapter.

A good starting point is identifying your incident response process. This process, like the security-minded administration process, can be thought of in a lifecycle model. There are a variety of incident response lifecycles that different organizations have created over the years. Some are waterfalls while some are circles, some have seven states while others have four. The lifecycle presented in this chapter and shown in Figure 11-1 is a notional lifecycle. Think of it as a summary of other models; we use it for illustrative purposes. If, through your own research, you discover other lifecycle models that resonate better with you, by all means use them instead.

Figure 11-1. Incident response lifecycle

11.1.1. Preparation

Hopefully, you don't find yourself in the throes of responding to security incidents on a continuous basis. If you do, you should probably revisit your core security mechanisms. Most of the time, you should be in the preparation phase. Here, you update your forensics tools and training staff, and identify internal and external resources that play a part in the incident response process. The Boy Scout motto of "Be Prepared" is a good mantra for incident response.

11.1.1.1 Identifying resources

You must identify resources that will play a role in incident response. Potentially one of the most daunting resources to identify is simply the assets on your network. Keeping a current list of all your hosts, the operating system version each one is running, and physical location can be a full-time job depending on the size of your network. Having an up-to-date list of all your servers is vital because you never know when an incident will occur. For instance, if the physical location of a host has changed but has not been updated in your asset inventory, it can cause problems with your response. Running into a data center and unplugging the wrong host is at the very least embarrassing and probably makes a bad situation even worse for your organization. To prevent this problem, you should also consider labeling the front and back of each server and router to prevent confusion in the heat of an incident even if your inventory is up to date.

Other important resources are the people in your organization. You should maintain a current contact list so that you have easy access to important individuals when they are needed. Further, you should create an escalation procedure to be used during an incident. Most IT resources in a company have two managers that need to be dealt with, the business manager who wants the service to be online and the technical managers who actually manage the IT resources. Both these trees need to be identified and included in the escalation procedures. Also, you may wish to set time thresholds for upper management notification; for example, if an incident keeps a host offline for 10 hours, then higher ups in the company are notified. This is basically a recognition that an incident that results in an extended service outage will cause more damage to the business the longer it goes on.

11.1.1.2 Training staff

Incident response is not a one-person operation. Your technical staff should be trained in how to respond to an incident and their roles in the lifecycle should be clear. While the training does not need to be formal, it should be enough to educate the technical staff in the first steps they should take when they think they have found a compromised host. Also, any business owners who have been identified in the escalation procedure should be briefed as to expectations and actions they should take during an incident.

11.1.1.3 Creation of document templates

There may be documents you need to fill out during an incident. The most likely document you deal with is a chain of custody. The chain of custody document is used to reconstruct what has occurred to an asset during an incident. This can be particularly useful if law enforcement gets involved and you need to prove that a host was under constant control and evidence was not tampered with.

It's rare to have to create documents that will stand up in court. Most incidents end without going to court; many are only concerned with containing a compromise and not actually pursuing the attacker. However, if you find yourself in a situation where you need to create documents that will be used in a lawsuit or criminal investigation, you should examine the Section 11.5 at the end of this chapter for more detailed direction.

A chain of custody document is filled out when acting on or transferring an asset during an incident. For instance, if you pull a drive from a compromised host and store it in a safe, you should document that activity on the chain of custody. A chain of custody template should include spaces for time, date, type of activity, name of asset acted upon, and spaces for two signatures. Capturing two signatures is important as a mechanism to prevent fraud. It's easy to get one person to lie about what has happened to an asset, it's much harder to get two.

11.1.1.4 Building your bag of tricks

To this point, our discussion of incident response has been a paper exercise. However, the technical preparation is just as important as the process and procedures we have talked about to this point. You should prepare your proverbial "bag of tricks" that you can reach into during an incident. This bag of tricks should include basic systems tools, forensic analysis aids, and some networking support. All of these tools should be placed on a bootable CD-ROM so they can be used independently of the host operating system.

There are a number of bootable CD images available that can provide a good starting point for you. Since this is a BSD book, and you will be dealing with BSD-based systems, we address the BSD-based bootable CD images. However, BSD is not your only option. For those with a Linux flair, the knoppix bootable CD image available from http://www.knoppix.net is a great live-CD image to start with.

FreeSBIE is based on FreeBSD. FreeSBIE provides tools for easily modifying the ISO image and creating customized CDs. While it's not explicitly a security-based distribution, it's actively developed and well supported. FreeSBIE is available from http://www.freesbie.org/.

Frenzy, another live CD based on FreeBSD, has the distinction of fitting on a mini-CD-ROM. Using a compressed filesystem, Frenzy actually has over 600MB of data that ends up getting shoved onto a 200MB CD. The mini-CD form factor is nice as it fits easily in a pocket and can be taken anywhere with you. However, Frenzy does not have the robust customization tools that FreeBSIE has. Frenzy is available from http://frenzy.org.ua/eng/.

11.1.2. Incident Detection

The next step of the incident response lifecycle is the actual detection of the incident. The detection may come from a variety of sources including an IDS sensor, log file analysis, user report, or simply odd host behavior. Whether it's an automated alert or a manual process, be prepared for incidents to be reported in a variety of ways. Don't expect something to jump up and say "This is a security incident." Some things may start as a simple host malfunction like excessive CPU usage or low memory situations. However, upon further inspection, you may discover the CPU utilization is due to a Trojan horse running on the host. At that point, the activity has moved from normal system administration to a security incident. Follow your gut feeling and be prepared to deal with incidents in a variety of forms. However, with that said, do not be overanxious to declare everything a security problem. Sometimes complicated situations can feel like a security compromise, but ultimately are system-level problems. A good question to ask your coworkers at this point is a simple "Did anyone change anything recently?" The answer to this simple question may be very enlightening.

11.1.3. Incident Assessment

Once you think you have a security issue, you need to determine the scope of the problem. Generally, time will be of the essence and the sooner you can make the proper assessment of what's happening, the better off you will be. Before you get started however, make a note of the time you started investigating so you have a trail to look back on.

Assessing the incident can be tricky. While you want to rapidly determine the scope of the problem, you do not want to cause further service interruption or disturb potential forensic data. At this stage, you want to examine log files, network traffic graphs, and IDS logs to help figure out the "blast radius." However, the whole time you are doing this, attempt to minimize change on the systems if possible. Avoid deleting files or writing to files if you can. This will make the after-the-fact forensic analysis easier.

In the course of your investigation, feel free to bring in the help of others that may be able to provide assistance. In particular, when doing your incident response preparation, you identified technical and business owners for each system in your network. If they can help troubleshoot or provide background information, give them a call and see if they can help assess the incident.

False Alarms

One of the authors was a security manager for a large e-commerce company. During one Christmas vacation, he received a frantic phone call from his Network Operations Center indicating that the main web site was under attack and had been down for two hours. As part of the escalation procedure, the manager was notified after the on-call staff could not resolve the problem in the first two hours.

Upon joining the teleconference that was set up to deal with the "attack," the author asked what type of attack it was and where it was coming from. No one knew or seemed to be able to describe what was going on besides high server load and potentially large amounts of traffic. After some troubleshooting and log analysis, the author determined that the referrers in the incoming web requests seemed to indicate that pop-up ads were accounting for a large portion of the web traffic.

A quick phone call to a marketing manager verified that the marketing department had kicked off a new advertising initiative right before the Christmas vacation. The extra traffic caused by the advertising campaign had exceeded the capacity of the web servers. The marketing manager tuned down the number of pop-up ads being served, and the "security incident" ended.

11.1.4. Response

Now things get sticky. You're sure you have a security problem, now you have to figure out what to do about it. In general, you have two goals: contain the compromise, and restore service as fast as possible.

Containing the compromise may be as simple as pulling the network cable from the compromised host. If you want to analyze the live system, leave the power on to the machine so you can run utilities from the CD drive and interact with the console. It should be noted that some attackers may install code that attempts to delete the attacker's tracks if the network interface goes down. While this makes for great Hollywood scripts and paranoid security administrators, it is not a common occurrence. Isolating a compromised host is usually better for the business than potentially losing data on the host.

Be sure you check hosts that are similar to the compromised host. For instance if you are examining a compromise on one web server in a cluster of web servers, it is entirely possible that the attacker used an automatic exploitation mechanism that could easily subvert all the web servers at once.

Restoring service can be tricky. If you have extra hardware and known good backups, you can usually restore the backup to extra hardware and place the machine in service. Be aware, however, that if you deploy a host with the same vulnerability that allowed the attacker to get in the first time, he will likely be able to compromise the new host. You may wish to wait to restore service until you know how the attacker got in. Moreover, if it's taken you a while to identify the security incident, recent backups may contain traces of malicious activity. In this case, it doesn't make sense to restore until you can identify exactly when the attacker got in.

If you don't have extra hardware and want to reuse the existing host, make a backup of the disk before you do anything. For information on copying disks, see "Forensics on BSD," later in this chapter. Then, reinstall the operating system from scratch. It is generally unwise to just reinstall applications on top of a compromised operating system. Unless you are 100% sure that the attacker did not leave any tools or processes in unknown places, you should just nuke the drive.

11.1.5. Postmortem Analysis

Even with the best attempts, you will have a security compromise periodically. It's important to learn from each and every one. Allowing a particular compromise to happen once is excusable; allowing it twice is not. After the events of the incident have finished and the parties have had some time to rest (approximately 24 hours is a reasonable amount of time) you should schedule a post mortem review of the incident.

If you wait too long, the people involved will likely have forgotten important details about the identification or response to the security incident. If you don't wait long enough, people will probably be too tired and want to "just get it over with" as soon as possible.

Through your review you should focus on two things: what went wrong and how to prevent it from happening again. The actions people take away from the meeting should be specific and have owners responsible for their execution.

< Day Day Up >