Just about everyone wants to figure out how to support more users on their Terminal Servers. While this is a noble endeavor, it's crucial to remember the opening paragraphs of this chapter where we discussed the limitations of server hardware.
In order to fit more users on your system, you need to look for bottlenecks. At some point, a bottleneck will always occur on every server. You just want to see if you can find it and if it's easily fixable. If it is, you can then look for the next bottleneck. Eventually you'll encounter a bottleneck that you can't work around, but hopefully by this point you'll have a lot more users on your server than you started with.
Even the largest datacenter-class servers have practical limits to the number of users they can support, especially if they are running 32-bit versions of Windows. Today's biggest 32-bit systems can support up to about 500 simultaneous users, although most servers are configured to support somewhere between 50 to 150 users.
In order to determine whether your servers can support more users, you'll need to break the server down into its technical components and analyze the techniques required to assess and optimize each one. Let's start by looking at the version of Microsoft Windows that runs on your Terminal Server.
Before you begin to troubleshoot and track down the performance issues of your Terminal Servers, you should understand that the version of Windows that you use on your servers can have a significant impact on its performance. Since you're reading this book, it's probably safe to assume that you are using or will shortly be using Windows 2003 for your Terminal Servers. If not, you should seriously consider using Windows Server 2003.
All things being equal, Microsoft Windows Server 2003 will support 25 to 80% more users than Windows 2000 Server (based on studies by Gartner, HP, Unisys, Microsoft, and the experience of the authors of this book). Most people are skeptical until they see it for themselves, but it's proven that you can fit a lot more users on a Windows 2003 Server than a Windows 2000 server with identical hardware.
The only footnote to this rule is that Windows 2003 supports more users than Windows 2000 only when the servers are not constrained by a hardware limitation. In other words, if a server is low on memory or underpowered, then Windows 2003 and Windows 2000 will experience equivalent performance. However, if you have a large server running Windows 2000 (let's say four processors and 4GB of memory), upgrading to Windows Server 2003 will enable you to run more users on that server.
The reasons for this are twofold. First, Windows 2003 performs much better than Windows 2000 across the board, not just with regards to Terminal Services. Secondly (and much less relevant although still interesting), you don't need to tweak and tune Windows 2003 as much as Windows 2000. Windows 2003 includes all the Terminal Server-related tweaks that have been "discovered" in the past three years with Windows 2000.
Memory is easily the most important hardware component of a Terminal Server. The amount of memory in a server completely affects the performance of other hardware components. For example, in addition to not being able to support as many users as you'd like to, not having enough memory can add an extra burden to the processor and disk since the page file would be more heavily used.
If you really want to fit more users onto your Terminal Servers, you'll need to investigate the physical memory. There are several steps to take to be successful at this:
You need to understand how memory works in Terminal Server environments.
You need to figure out whether you have enough memory to do what you want to do.
If the system seems to get slower over time, you need to check for memory leaks.
Before you even begin to think about whether a server has enough memory to support the desired number of users, it's not a bad idea to do a quick mental calculation to determine whether the server's memory is anywhere near appropriate.
The amount of memory required per user depends on the applications they use and the operating system of your server. It is typically appropriate to estimate about 128MB for the base system.
For heavy workers who use Outlook, IE, Word, and Excel and who often switch back and forth between them, a good estimation is to allow about 15MB of memory for each user. For light, task-oriented data entry users who only use a single application without the Windows shell, you can count on about 7MB per user. If your Terminal Servers support complex client/server line-of-business applications, then there's no way to guess how much memory you need. It could easily be 20, 30, or even 50MB per user, depending on the application.
Of course these numbers are just starting points, and your mileage may vary. At first glance these numbers may seem far too low, but in reality they're fairly accurate. (Remember, we're talking about physical memory—not the total memory—required per user.) To understand this, let's first look at how Windows 2003 Terminal Servers use memory.
Every user that runs an application on a Terminal Server will use the memory for that application just as if it were running it on a normal workstation. A quick check of the task manager shows that Microsoft Word requires about 10MB of memory to run. Each user of Word will need 10MB, and 20 simultaneous users will require 200MB of memory.
Of course, this is on top of the overhead required for each user to run a session, which is about 4MB. Twenty users running Word require a collective 280MB of memory on the Terminal Server.
To this you must add the memory required by the base operating system, which is usually around 128MB on a Terminal Server. If you add all of these together, you'll find that 20 users will require that a server have 408MB of memory.
Before you run out and start checking the memory usage of your applications, you should know the two reasons that any calculations you make based on these parameters will be totally useless:
Applications require varying amounts of memory. Although task manager showed Microsoft Word to consume only 10MB of memory, memory consumption will vary greatly depending on the open documents. Download and open a 300-page graphics-laden Windows 2003 white paper, and you'll see that Word can consume much more than 10MB. Also watch out for supporting files, such as DLLs, which sometimes consume the largest amount of memory.
Windows treats multiple instances of the same executable in a special way. If 20 users are all using Word at 10MB each, then you would assume that 200MB of memory is being consumed, right? In actuality, Windows is a bit smarter than that. Because all 20 users are using the same copy of winword.exe, the system figures that it doesn't need to physically load the same binary executable image into memory 20 times. Instead, it loads the executable only once and "points" the other sessions to that first instance. This is done discreetly. The components controlling each user's session think that they have a full copy of the executable loaded locally in their own memory space, when in fact all they have is a pointer to another memory space. If one session should need to modify the copy of the executable in memory, the server seamlessly (and quickly) makes a unique copy of the executable for that session.
What is particularly tricky here is the fact that if you look at the task manager, each user's session will report the full amount of memory being used. Only in the total memory usage statistics will you see that the numbers don't add up.
Most people use Task Manager to provide information as to whether a server has enough physical memory (Performance Tab | Physical Memory (K) | Available). The problem here is that the number reported by Task Manager is a bit misleading. While it does correctly show the amount of physical memory that's free, it doesn't really show why that memory is free. For example, if Task Manager shows that you're almost out of memory, you might think that you need more, when in fact adding more memory won't help at all. To appreciate why, you need to understand how Windows manages memory allocation.
In the Windows operating system, individual processes request memory from the system in chunks. Each chunk is called a "page" and is 4K in size. Any pages of memory that Windows grants to a process are called its "committed memory." A process's committed memory represents the total amount of memory that the system has given it, and this committed memory can be in physical memory or paged to disk (or some of both).
Windows watches how each process uses its committed memory. The pages of memory that a process uses often are stored in physical memory. These are called a process's "working set." Of all of a process's committed memory, the working set portion is stored in physical memory and the rest is stored in the page file on the hard disk.
When physical memory is plentiful (i.e. when overall system memory utilization is low), Windows doesn't pay much attention to how each process uses the memory it's been granted. In these cases, all of a process's committed bytes are stored in physical memory (meaning that each process's working set is the same size as its committed memory—even if it doesn't actively use all of it.)
However, when overall physical memory utilization gets a bit higher (80% overall by default), Windows starts to get nervous. It begins checking with all the processes to see how much of their working sets they are actively using. If a process isn't using all of its working set, the system will page some of it out to disk, allowing the system reclaim some physical memory for other processes. It's important to note that this is natural, and does not negatively affect performance, since only unused portions of a process's working set are paged to disk.
Remember that the system doesn't start reclaiming unused working set memory until overall memory utilization reaches 80%. In systems with plenty of memory, Windows doesn't bother paging out unused working set memory.
A "side effect" is that when memory is plentiful, looking at how much memory the system is using at any given point in time (by looking at the working set memory) will show a number that's much higher than it needs to be. This can lead you down the path of thinking that your server needs more memory than it actually does.
To summarize, the committed memory for a process is the amount of memory that Windows allocated it, and the working set is the subset of committed memory that's in the physical RAM. If a process doesn't actively use all of its working set, the system might take away some of the working set in order to better use the physical RAM somewhere else. This does not affect overall system performance since the process wasn't using that memory anyway.
In environments that start to run out of memory, the system will get desperate and start to page out portions of the working set that a process is currently using. At that point, operations will begin to slow down for your users, and you won't be able to add any more users. This is the circumstance you need to look for with Performance Monitor.
Using the Performance MMC snap-in to track actual memory usage is difficult. Now that you understand (or have read, anyway) how Windows 2003 processes use memory, let's use a real-world analogy so that you can start to appreciate the complexity of the issues that you'll need to track. Refer to Figure 13.2 as you read through this short analogy.
Your office building
Offsite Document Storage
A Process or Program
Section of memory used by a process
A Paper Document
Imagine you work in an office building processing documents. If there are only a few people in your office, you'll probably have plenty of room to store all your papers. Everything—even papers you haven't touched for the past twenty years—are stored in your office since space is plentiful. Even though you're storing a lot of unused documents, it doesn't matter since there's so much space in your office.
If you get some new coworkers, they will take up some of your office space. To mitigate this, your boss might say, "Sort through your files and put anything you haven't touched in five years in offsite storage." That would free up space to allow more people to work in the office. Doing this probably wouldn't affect your productivity, since you hadn't touched those papers in five years anyway.
As your company continues to add employees, you need to provide space for them. To free up space, your boss might ask you to move documents that are only a year old to offsite storage. This is not a big problem, since you can always request your documents back from offsite storage. (Of course, this takes a few days and you'd have to send something else there in its place to make room for the newly-retrieved documents.)
If your company continues to add employees, you might be forced to store documents that are only six-months old, and then one-month old. Eventually there will be so many people in such a small office that the majority of your documents will be in offsite storage and no one will be very productive. The only solution is to get a bigger office or lay off some employees (or just convince your customers that this is how all companies work).
Supposing you are a business analyst called in to alleviate this company's document problem, how would you measure it? It's obvious that they need more space or fewer people, but what is reasonable? How can you track productivity as compared to the ratio of onsite to offsite document storage? It's difficult to create hard numbers for this. Who can decide what's acceptable and what's not?
This difficulty is why it's hard to use the Performance MMC to track real memory usage on a Terminal Server. It's also why people have not traditionally been good at estimating the actual memory requirements in the past.
The best way to use Performance Monitor in this case is to track trends. You'll need to watch several counters at once to see how they relate to each other when your system starts behaving poorly. Here are the counters that you should track:
The Working Set counter is actually a property of a process, not the system memory. Therefore, you'll need to select the process object. You'll see every running process listed, but selecting the "_Total" will give you the grand total, in bytes, of all the working sets of all processes on the system. When you check this, remember that this will not include the base memory, so you'll need to add another 128MB or so.
This counter tells you the number of times per second that a process needed to access a piece of memory that was not in its working set, meaning that the system had to retrieve it from the page file. From our analogy, this is comparable to when an employee needed to use a document that was at the offsite storage facility.
This counter is the opposite of the "Pages Input/Sec" counter. It tells you how many times per second the system decided to trim a process's working set by writing some memory to disk in order to free up physical memory for another process.
This counter tracks the amount of bytes that are free in the physical memory. Free bytes are ready to be used by another process. When memory is plentiful, these free bytes are used with reckless abandon, which is why you can't track this counter alone to determine memory requirements.
You should also track the number of user sessions you have on the system in order to have a reference point for what was going on.
So what exactly are you looking for? Start adding some users to the system by following the server sizing test process outlined in Chapter 5. One of the first things you'll notice is that the working set will grow at a very high rate. (Remember that the system doesn't bother managing it until physical memory starts running low.) You'll also notice that the Available Bytes counter will initially take a nosedive since the system freely lets processes' working sets stay in memory. At this point you'll see some activity in both the Memory Pages counters as new users log on, but nothing to be concerned about. (Spikes may be in the 200 range, but overall fairly low.)
As you continue to add users, you'll eventually notice that the working set counter starts to drop. This occurs when the physical memory starts to run low and the system has started trimming the unused portions of processes' working sets. The Pages Output/Sec counter will start to spike high (several thousand per second even) as the system begins writing those pages to disk. As you continue adding users, the Pages Output/Sec spikes will decrease, although the general trend will be that Pages Output/Sec will increase. This is to be expected, and the performance of your system will not suffer because if it. (Don't buy more memory just yet!)
As you continue to add more users, the working set will continue to drop, and both of the memory pages counters will continue to steadily rise. Available Bytes is down around zero at this point.
What does all this mean? How do you know how many users you can add, and whether more memory will help you? If your server doesn't show the trends as described, then you have plenty of memory. The critical counter to watch is the Pages Output/Sec. Remember that it remains low for a while and then starts spiking dramatically. The spikes slowly become less and less pronounced until the counter begins rising overall. The point between spike's dying down and the counter's slow rise is the sweet spot for a Terminal Server. If your counter never starts to rise significantly after it's done spiking, then you have enough memory and your user base is limited by something else. If you server's Pages Output/Sec counter starts to steadily climb after it's done spiking, then you could probably benefit from more memory or other tuning techniques outlined later in this chapter.
Another problem relating to memory that can negatively impact the performance of a Terminal Server is a memory leak. A memory leak takes perfectly good memory away from the system. Most memory leaks are progressive and take up more and more memory as time goes on. Depending on the leak, this could be as slow as a few KB per hour and as fast as several MB per minute. In all cases, memory leaks are due to a problem with an application or driver, and most can be fixed with a patch (assuming the application vendor is competent enough to create one).
A memory leak is like a cancer, slowly eating away at the system. Left unchecked, it will cause the system to slow down until it becomes completely unresponsive and hangs. Memory leaks can also cause client connection and disconnection problems, excessive CPU utilization, and disk thrashing. Sometimes you can identify the offending application and kill it manually. Other times a system reboot is the only fix.
The good news about memory leaks (if there is any) is that they are very rare these days, occurring most often in "homegrown" or "really crappy" applications. It seems like they used occur all the time in Windows NT, but not so much anymore. If you ever run into a consultant who's quick to suggest that all Terminal Server performance problems are due to memory leaks, you will know that this is an "old school" person who hasn't updated his troubleshooting skills in five years.
Identifying a memory leak is usually pretty easy. (Figuring out what's causing it is the hard part.) To do this, you'll also need to use the Performance MMC snap-in. In technical terms, a memory leak occurs when the system allocates more memory to a process than the process gives back to the system. Any type of process can cause a memory leak. You can see that you're having a memory leak by monitoring the Paged Pool Bytes (Memory | Pool Paged Bytes) and Page File Usage (Paging File | %Usage | _Total) counters. If you see either (or both) of these counters steadily increasing even when you're not adding more users to the system, then you probably have a memory leak. You might also have a memory leak if you see a more intermittent increase (still without adding new users), since memory leaks can occur in processes that aren't always running.
As noted previously, identifying that you have a memory leak is easy. Figuring out which process is causing it is harder. You can use Performance Monitor to track the amount of memory that each process is using (Process | Private Bytes | pick a process). Of course on Terminal Servers, you'll have hundreds of processes running, so it's not like you'll want to track each one. Unfortunately, there isn't an easier way to find it. (This is why IT departments need interns.) You can chart all processes at once by selecting the "all instances" radio button on the "Add Counters" dialog box. Sometimes this works well, especially on idle servers. The hundreds of lines paint horizontal stripes across the chart, and any increase is immediately visible. When you see it, enable highlighting (Ctrl+H) and scroll through your list of processes until you find the one that's steadily increasing.
What should you do if you weren't able to isolate the memory leak with Performance Monitor? In this case the memory leak was most likely caused by something operating in kernel mode. Kernel mode memory leaks usually require the assistance of Microsoft Product Support to identify. They'll have you run a utility (poolmon.exe, located on the Windows CD in the "support" folder) that monitors the kernel mode memory pool and outputs contents to a command window.
If you do manage to figure out the cause, there's nothing you can really do about it other than to contact the vendor for a fix or to discontinue using whatever's causing it.
The Windows page file is an interesting creature. A common misconception is that it's "merely" an extension of physical memory used on servers that don't have enough memory. Most people think that if they buy enough physical memory, they'll never have to worry about the page file. In Terminal Server environments, nothing could be further from the truth. While it's true that the page file is used more when physical memory is scarce, Windows also uses the page file in other ways.
Remember from the previous section that Windows is smart enough to only load a single copy of a binary executable into memory when multiple processes (or users) utilize an application. That's technically called "copy-on-write" optimization, since Windows will make an additional copy of a portion of the application in memory only when a process attempts to write to it.
In Windows environments, every executable and DLL is written to as it's used. (This doesn't mean that the EXE or DLL files on the disk are written to. It simply means that once they're loaded into memory, the versions in memory change as they are used.)
Therefore, a single DLL is loaded into memory and the system lets multiple processes share it. However, as soon as a process tries to write to a portion of that DLL, the system makes a quick copy of it (via the "copy-on-write" functionality) and lets the process write to the copy instead. Additionally, the system also backs up that section of that DLL to the page file for safekeeping. This means that there are effectively three copies of that portion of the DLL in memory—the original, the copy for the other process to write to, and the backup in the page file. This same phenomena occurs for every program that is shared by multiple processes (or users), including EXE and DLL files.
In regular Windows environments, backing up the copy-on-written section of an executable to the page file is no big deal. However, imagine how inefficient this is in Terminal Server environments!
Think about a Terminal Server hosting 30 users who are all using an application such as JD Edwards One World. This application is a standard client / server application, and launching the JD Edwards client software loads an executable and several DLLs into the memory space of each user's session. However, Windows only initially loads a single copy of the executables into physical memory.
As all 30 users utilize the application, Windows' copy-on-write optimization will create 29 "copies" in memory of large portions of each JD Edwards executable. (One for the first user and 29 copies for the 29 other users.) This means that Windows will have also placed an additional 29 copies of the original executables in the page file before the copies were made. The 30 JDE users will effectively cause 59 copies of the single executable to be loaded in memory. Now, imagine this multiplied by each of the many EXEs and DLLs that JD Edwards loads.
Unfortunately, this is all too common. These fat client/server applications were never designed for Terminal Server environments. Applications like JDE OneWorld, Cerner, Lotus Notes, Siebel, PeopleSoft, SAP, and others all load massive client environments when they're launched. (This usually includes the core EXE plus several DLLs.)
Understanding this behavior starts to give you an idea of just how important the page file is in Terminal Server environments, regardless of how much physical memory you have. To help mitigate this, there are really only two things that you can do for a page file:
You can change the way Windows uses the page file.
You can make the page file faster.
Windows' copy-on-write "optimization" is part of the core Windows memory management components, and you can't just turn it off. "Unfortunately," as Kevin Goodman puts it, "it's not like there's a 'NoCopyOnWrite' registry flag that you can use to disable it."
However, you can use third party software products to change the way that Windows implements this copy-on-write functionality. This can be done with RTO Software's TScale product, which is also sold by Wyse under the Expedian brand.
TScale watches how applications use their working sets and how multiple instances of an application are affected by the Windows copy-on-write optimizations. It logs potential optimizations to an optimization map file on the server's hard drive. Then, the next time a user launches the application, the server reads the optimization map.
This optimization allows multiple instances of an application to share the backup copies in the page file. This dramatically cuts down on page file usage, which in turn frees up the processor to support more users. TScale also decreases the working set of each instance of the application, freeing up memory that can allow you to support more users. Each application on a Terminal Server is analyzed separately, and each has its own optimization map.
TScale really shines with the big client/server applications. In fact (and quite ironically), the only applications that TScale doesn't greatly affect (maybe 10% more users instead of 30% more) are applications from Microsoft, such as Office, Visio, and Project. (It's almost as if the folks writing these applications in Redmond know something about they way Windows works that no one else does.)
RTO Software offers a 30-day evaluation copy that you can download from www.rtosoft.com. You can see for yourself how much of a difference it would make in your environment.
Even after applying TScale or Expedian page file optimizations, your page file will still be used in a Terminal Server environment. Because of this, you need to ensure that your page file is as accessible as possible.
A heavily-used page file will overly tax the disk I/O. Therefore, refer to the "Disk Usage" section of this chapter for information about how to determine whether your hard disk I/O capacities are causing bottlenecks in your environment. If you determine that your page file is your bottleneck and you'd like to make it faster, there are a few things that you can do:
Put the page file on its own drive on its own SCSI channel.
Buy one of those flash RAM hard drives like a TiGiJet from www.tigicorp.com ). These look like regular hard drives except that they are solid state. They have Flash RAM instead of disks and spindles. They're very fast, but also very expensive, costing several thousand dollars for a few gigabytes.
None of these solutions will make a dramatic difference, and you shouldn't even attempt them until after you've implemented a software page file optimization solution like TScale or Expedian.
The last aspect of the page file that has the ability to affect how many users you can fit on your server is the page file size. If your page file runs out of space, then you won't be able to fit any more users on your server. The "official" page file size recommendation for Terminal Server environments is 1.5 times the amount of physical memory. However, this does not need to be strictly followed. When determining your page file size, look at the types and numbers of applications that users will be using. Also consider the amount of total system memory. If you have a server with 512MB, then 1.5x page file is adequate. If you have 8GB of memory, you can probably get away with a smaller page file. Try a 4GB page file first and then increase from there if necessary.
You can check the percentage of page file usage via the following Performance counter:
Once you figure out the size that you want your page file to be, go ahead and configure your server so that the page file starts out at full size. To conserve disk space, Windows allows you to specify a minimum and maximum page file size. The system starts with the minimum and then grows from there. Unfortunately, this means that your system would need to spend resources extending the page file right when the resources are needed most. (After all, that's why the page file is being extended anyway.) Configuring the page file to start out at the maximum size (by entering the same values for the maximum and minimum sizes) will let you avoid this situation. Besides, disk space is usually plentiful on Terminal Servers.
Fortunately, understanding the processor usage of a Terminal Server is much easier than understanding memory usage. There are a few simple steps that you can take to evaluate the processor and address any issues you might find.
Understand how Terminal Servers make use of processors and how you can track there usage.
Take steps to minimize the impact that applications have to the processor.
If your server is running on Intel Xeon processors, understand how enabling or disabling Hyperthreading will affect performance.
Tracking processor utilization is easy with the Performance MMC. Add the following two counters to your chart:
This counter shows how busy the processors are. If it pegs at 100% then you need more of something. However, if the processor is too busy, don't automatically think that you need more processing power. The processor might be busy because you're running out of memory and it is spending unnecessary time writing to and reading from the page file.
If you notice that the processor utilization is fairly high, you might want to track the Processor Queue Length counter as well. This counter shows how many requests are backed up while they wait for the processor to get freed up to service them. By tracking this, you can see if the processor is very busy or too busy. (Yes, there is a difference.) A processor that is very busy might show 100% utilization, but it will back down as soon as another request comes through. You can see this because the Processor Queue Length will be almost zero. A processor that is too busy might also show 100% utilization, except that because it's too busy it cannot service additional requests, resulting in the Processor Queue Length beginning to fill up.
You should always use the Performance MMC instead of the Task Manager to get a good understanding of the processor utilization of your system. Task Manager is only meant to be used as a general estimation of the system and can be off by 5% or more at times.
If you determine that the processing power of your server is limiting the number of users you can host, there are several things that you can do. Your overall approach will be to identify unneeded activities that are using processor resources and eliminate them.
Many applications have "features" that are enabled by default and wreak havoc on the performance of Terminal Servers. For example, disabling Microsoft Word 2000's background grammar checking will allow you to double the number of users on a server. Even though grammar checking might not be too taxing on a single system, one hundred users with constant background checking can severely impact a Terminal Server.
The good news is that you now know what can cause unneeded processor utilization. The bad news is that these issues are impossible to detect with Performance Monitor. (With 120 users on a server, would you really know that disabling background grammar checking could take each user's average utilization from 0.82% to 0.46%?)
Another example of this is that disabling Internet Explorer's personalization settings will dramatically increase its loading speed and allow you to run more users.
The bottom line here is that you'll need to manually weed through each of your applications and disable all the neat "features" that could potentially consume resources. Some people feel that they shouldn't be forced to do this. Whether you do or don't depends on what you want out of your servers. Do you want to fit the most users you can on a server or do you want to have shadows under your mouse cursors? You decide.
If your users can live with 256 colors instead of 24-bit color, you might be able to fit 10% more users on your servers. Remember that each option you disable won't have enough impact its own, but multiplying its effect by several hundred users can easily produce some dramatic results.
How your users actually use their applications also affects processor utilization. If you build a Terminal Server that hosts only Microsoft Word, you'll fit a lot more 20 WPM typists than you will 65 WPM typists.
As you analyze your application usage, don't forget to consider the applications that run in the background on the server. A good example of this is antivirus software. Whether you should run antivirus software on your Terminal Servers is a debate that will continue for some time. However, understand that running antivirus software that offers "live" file system protection will severely limit the number of users you can fit on a server. Again, this is not something that you'll be able to track with the Performance MMC. If you can devise an alternate antivirus plan (such as antivirus protection at the perimeter and file server level), you may be able to add 20 to 50% more users to your Terminal Servers.
If your Terminal Server systems are running Intel Xeon processors, then you have the added option of enabling Hyperthreading (via the BIOS). Hyperthreading is an Intel-proprietary technology that makes one processor look like two to the operating system (and two processors look like four, etc.). Xeon processors have two data pipelines going in and out of the core processing unit on the chip. The processor generally alternates between the two pipelines. The advantage of Hyperthreading is that by having two inputs, the CPU will always have something to execute. If one pipeline is not quite ready, then the CPU can pull code from the other pipeline. This happens billions of times per second.
Hyperthreading does have some general disadvantages. One major disadvantage (which doesn't affect performance) is that many applications that are licensed based on the number of processors are tricked into thinking the system has doubled the amount of processors it actually has.
Another potential problem with Hyperthreading (which does affect performance) is that in terms of instruction execution, Windows isn't smart enough to know that you have Hyperthreading. It can't tell the difference between a twoprocessor system with Hyperthreading enabled and a regular four-processor system. This can lead to performance problems since the system might split up complex process across two of the four processors for faster execution. In a Hyperthreaded system, this might mean that those two threads are both going to the same physical processor, while the other processor sits idle.
Another potential problem with Hyperthreading is that some drivers are not multithreaded and therefore not able to make use of both virtual processors. For example, a single-threaded network card driver can cause system-wide bottlenecks if all network traffic is processed by a single virtual processor.
Given all the potential complications of Hyperthreading, studies show that enabling Hyperthreading can increase the overall performance of a Windows 2003 Terminal Server. According a Tim Mangan of TMurgent, enabling Hyperthreading on Windows 2003 Terminal Servers seems to give a 10 to 20% performance boost, meaning that the CPU can support 10 to 20% more users.
The overall effect of enabling Hyperthreading varies depending on the hardware, applications, and server load. Therefore, there is no rock solid rule regarding Hyperthreading. You'll need to try it in your environment to see whether it helps or hurts you.
In terms of performance, the hard drives of a Terminal Server are rarely the bottleneck. However, they have the potential to be and you should certainly check them with due diligence.
Before you investigate your hard drives, be sure to check the memory and page file usage, since not having enough memory can cause excessive paging and your hard drives to work harder than they have to.
In most environments, your Terminal Servers only need to contain the Windows operating system, the page file, and your software application files. User and application data is usually stored on other servers or a storage area network (SAN), so the local drives don't slow things down based on user file access.
In order to evaluate whether your hard drives are slowing down your server, check the following Performance counters:
This counter shows you how busy your server's hard drives are. A value of 100% would indicate that the disks are 100% busy, meaning that you might need faster disks, more memory, or fewer users.
As with the processor counters, if your % Disk Time counter is at or near 100, you might also want to monitor this counter. It will tell you how many disk requests are waiting because the disk is too busy. If you have multiple physical disks in your server (that are not mirrored), you should record a separate instance of this counter for each disk instead of one counter for all disks. This will allow you to determine whether your disks are being used evenly, or if one disk is overworked while another sits idle.
If you do determine that your server's hard drives are the bottleneck, there are a few approaches you can take. The first is to investigate your server's disk configuration. You can also opt to replace the current disks with faster ones, or perhaps even change your disk architecture altogether.
Server vendors have all sorts of tricks you can implement to increase the performance of your disks. A now infamous example is that Compaq's RAID cards and disks came with 64MB of cache that was all configured as read-cache. Changing the cache configuration (via a software utility) to 50% read and 50% write cache allowed people to almost double the number of users they could put on a system.
As a quick side note, you'll notice that many of these solutions allowed companies to "almost double" the number of users they can support. Keep in mind that this entire performance analysis is all about finding bottlenecks. In your case, implementing one change might only yield a 5% increase since it would then reveal a new bottleneck. You might have to work your way through several bottlenecks before you see substantial performance gains.
If software configuration alone won't alleviate your disk-related bottleneck, replacing your current disks with faster ones should allow you to (at least partially) release some of the pressure from the disks. Remember that when dealing with servers, you can buy faster disks (15k verses 10k RPMs), a faster interface, or both.
Finally, you might ultimately decide that changing the architecture of your disks is the best way to fix your disk problem. If you have a server with two mirrored drives, you might decide that breaking the mirror and placing the operating system on one disk and the paging file on the other is the best way to make use of your hardware.
Limitations of the physical network interfaces on Terminal Servers certainly have the ability to cause a bottleneck. What's interesting about this is that it's usually not the RDP user sessions that clog the network card. Rather, it's the interface between the Terminal Server and the network file servers that usually cause the blockage. These connections are responsible for users' roaming profiles, home drives, load-balancing, and all other back end data that's transferred in and out of a server. Hundreds of users on a single server can easily saturate this link. As with all hardware components, if your network link is your bottleneck, you'll be limited as to how many users you can fit on your server regardless of how much memory or processing power you have.
The easiest way to check your network utilization is via Performance Monitor. Of course you can also use Task Manager (Networking Tab) to do a quick check, but you can't save anything or get any quick details from there. Within the Performance MMC snap-in, load the following counters:
Be sure to select a different instance of this counter for each network card in your server. When you look at the results, keep in mind that a 100 Mb/second network interface is 100 megabits, and the performance counter tracks bytes. Since there are 8 bits in a byte, the performance counter would max out at 12.5M bytes. If you factor in physical network overhead, the actual maximum of a 100Mb network is about ten megabytes per second.
Just like the other counters, you can determine whether you have a network bottleneck by looking at the output queue lengths of your cards. If you see a sustained value of more than two, then you need to take action if you want to get more users on your server. Of course, there's no input queue length counter for network cards.
Identifying network bottlenecks at your server is easy. Fixing them can be as simple as putting a faster NIC or implementing NIC teaming or full duplexing to double the interface to your server.
However, you might be able to relieve some pressure from your server's network interface by adjusting the overall architecture of your Terminal Server environment. For example, you might choose to build a private network connection between your Terminal Servers and your user's home drives. This would allow users' RDP sessions to flow over one interface while their data would flow over another.
If your environment involves long-haul networks or WAN links, it's possible that a delay in that area could cause adverse performance issues. That issue is not tied specifically to the user capacity of a server though, and is discussed later in this chapter in the "Overall Sluggishness" section.
In some cases, the performance of your system might slow down (or you might hit user limits) when it appears that plenty of hardware resources are available. What should you do when you're experiencing a major performance limit in spite of testing the memory, processor, disks, and network interface, and determining that none of them shows evidence of being the bottleneck?
By running hundreds or even thousands of processes, you're pushing the architectural limits of Windows Server 2003, especially on larger servers. To understand why (and how to fix it), you need to understand how Windows works.
Have you ever wondered what the "32" means in 32-bit Windows? If you thought it has to do with 32-bit processors from Intel, you're half-right. In fact, the 32-bit Windows name derives from the fact that Windows has a 32-bit memory address space. Windows can only address 232 bytes (or 4GB) of memory, regardless of the amount of physical RAM installed in a system. Thinking back to your Windows training, you'll remember that this 4GB memory space is split in two, with 2GB for user-mode processes and 2GB for kernel-mode processes.
Every user-mode process has its own personal 2GB address space. This is called the process's "virtual memory," since it is always available to the process regardless of the amount of physical memory. (A common misconception is that "virtual memory" refers to memory that's been paged out to disk. This is simply not true when discussing the kernel.)
The kernel and other important system functions all share the same "other half" of the 4GB total memory space. This means that all kernel functions must share the same 2GB memory area.
As you can imagine, this 2GB "kernel-mode" memory space can get quite crowded since it must house all kernel-related information (memory, drivers, data structures, etc.). There is effectively a limit on the amount of datastructures and kernel-related information a system can use, regardless of the amount of physical memory.
On particularly busy Terminal Servers, certain components running in this 2GB kernel address space can run out of room, causing your server to slow to a crawl and preventing additional users from logging on. This can happen in environments with plenty of memory, processors, disk space, and network bandwidth.
The good news is that you can tweak the kernel memory usage of some of the kernel's key components, allowing you to fit more users on your server. The bad news is that since all of these components share the same 2GB memory area, giving more memory to one component means that you have to take it away from another.
You're looking for the right balance. Increase one area 5% and you might get ten more users on a server. Increase it 6% and you might get twenty fewer users on the server.
Since adjusting one component affects another, start out by looking at several components together. The kernel memory area is divided into several parts, with the two major parts (called "pools") being a nonpaged pool and a paged pool. The nonpaged pool is a section of memory that cannot, under any circumstances, be paged to disk. The paged pool is a section of memory that can be paged to disk. (Just being stored in the paged pool doesn't necessarily mean that something has been paged to disk. It just means that it has either been paged to disk or it could be paged to disk.)
Sandwiched directly between the nonpaged and paged pools (although technically part of the nonpaged pool) is a section of memory called the "System Page Table Entries," or "System PTEs."
To understand what a PTE is, (and its relevance to Terminal Server sizing), you have to think back to what you just read about how each process can use up to 2GB of memory.
In reality, most processes don't actually use anywhere near their 2GB of memory. However, since each process thinks it has a full 2GB, it references all its memory as if it were the only thing running. Since the system must track the actual memory usage of every single process, it must "translate" each process's memory utilization to physical memory or page file locations. To do this, the system creates a memory page table for each process.
This page table is simply an index that keeps track of the actual locations of a process's memory pages. Each entry in this table tracks a different memory page, and is called a "Page Table Entry." The 2GB of memory that the kernel uses also has a page table. Entries in that table are called "System PTEs."
Why does all this matter when you're troubleshooting the performance of a Terminal Server? Simply, the maximum number of System PTEs that a server can have is set when the server boots. In heavily-used Terminal Server environments, you can run out of system PTEs. You can use the registry to increase the number of system PTEs, but that encroaches into the paged pool area and increases the risk that you could run out of paged pool memory. Running out of either one is bad, and your goal is to tune your server so that you run out of both at the exact same time. This will indicate that you've tuned the kernel's memory usage as optimally as possible.
In Windows 2003, the system file cache is the part of memory where files that are currently open are stored. Like PTEs and the Paged Pool, the System File Cache needs space in the 2GB kernel memory area. If the Paged Pool starts to run out of space (when it's 80% full by default), the system will automatically take some memory away from the System File Cache and give it to the Paged Pool. This makes the System File Cache smaller. However, the system file cache is critical, and so it will never reach zero (which means yes, the Paged Pool can still run out of space). The system will make a trade-off and try to extend the Paged Pool as much as possible.
Remember that these symptoms occur on systems that are heavily loaded and that do not show any other signs of hardware limitations. If your processor is pegged at 100%, don't you dare try to make any of the changes outlined in this section.
Fortunately, Windows Server 2003 does a really good job of managing kernel memory. It has twice as many System PTEs as Windows 2000, and other memory management enhancements cause Windows 2003 to generally use less of the system paged pool than Windows 2000. All this means that you should be able to get four or five hundred users on a system before having to think about kernel memory tuning.
This does not mean that you can easily get four or five hundred users on a system. It means that most likely, this particular problem will not be seen until you get somewhere like four or five hundred users. Of course all this is application dependent, and could happen with only two or three hundred users.
Your Windows 2003 server will automatically use the maximum number of PTEs, so long as you:
Boot the server without the /3GB option. (More on this later.)
Do not have the registry key set that allows for this RAM to be used as system cache since it will be used for System PTEs instead. (Start | Right-click on My Computer | Properties | Advanced tab | Performance Settings button | Advanced tab | Check the "Programs" option in the Memory usage section)
Do not have registry keys set that make session space or system mapped views larger than the default size (48MB).
If you feel that you might be running out of PTEs or paged pool on a Windows Server 2003 Terminal Server, there's an easy test to do. Instead of using the Performance Monitor, you can get a much more accurate snapshot of kernel memory usage with a kernel debugger. Using a kernel debugger probably conjures up nightmares about host machines and serial cables and all sorts of things that developer-type people handle that you probably never ever thought you would have to. Fortunately, times have changed.
In Windows Server 2003, using the kernel debugger is easy. It's windows-based, and you can even use it to debug the computer that it's running on. The kernel debugger can tell you amazing things about the state of the Windows kernel, which is exactly why we're going to use it here.
The first thing you have to do is to download the Debugging Tools for Windows. They're available from: http://www.microsoft.com/whdc/ddk/debugging/.
Go ahead and install them on the Terminal Server that you're testing. Choosing the default options should be sufficient in this case. Then, fire up the Windows Debugger (Start | All Programs | Debugging Tools for Windows | WinDbg).
Next, you'll need to establish a kernel debugging session with the local computer.
Choose File | Kernel Debug
Click the "Local" tab
Now, before you can effectively use the debugger, you need to tell it where your "symbol" files are. Symbol files are files with the .SDB extension that tell the debugger how to interpret all the information that it's getting from the system. Without symbol files, the debugger is useless.
Information about obtaining symbol files is available from the Microsoft Debugging Tools for Windows webpage referenced previously. At the time of this writing, there are two ways to use symbol files with the Windows Debugger.
You can download the symbol files to your hard drive (or copy them from the Windows Server 2003 CD). Then, you configure the Debugger so that it knows where to look for them.
You can use Microsoft's new "Symbol Server," which allows the Debugger to automatically and dynamically download the required symbol files from Microsoft's web site.
Using the symbol server is by far the easiest option and the one you should choose unless your Terminal Server does not have direct Internet access. Configuring your Windows Debugger to use Microsoft's symbol server is easy:
Choose File | Symbol Path File in the Debugger.
Enter the following statement into the box:
("c:\debug" can be changed to an appropriate local storage path for your environment.)
Check the "Reload" box.
Now you're ready to begin debugging. Have your users log into the system. Then, when your system begins to slow down, enter the following command into the "lkd>" box at the bottom of the debugger screen:
This will display a snapshot of the kernel's virtual memory (its 2GB memory area) usage, which should look something like this:
lkd> !vm 1 *** Virtual Memory Usage *** Physical Memory: 130908 ( 523632 Kb) Page File: \??\C:\pagefile.sys Current: 786432Kb Free Space: 767788Kb Minimum: 786432Kb Maximum: 1572864Kb Available Pages: 45979 ( 183916 Kb) ResAvail Pages: 93692 ( 374768 Kb) Locked IO Pages: 95 ( 380 Kb) Free System PTEs: 245121 ( 980484 Kb) Free NP PTEs: 28495 ( 113980 Kb) Free Special NP: 0 ( 0 Kb) Modified Pages: 135 ( 540 Kb) Modified PF Pages: 134 ( 536 Kb) NonPagedPool Usage: 2309 ( 9236 Kb) NonPagedPool Max: 32768 ( 131072 Kb) PagedPool 0 Usage: 4172 ( 16688 Kb) PagedPool 1 Usage: 1663 ( 6652 Kb) PagedPool 2 Usage: 1609 ( 6436 Kb) PagedPool Usage: 7444 ( 29776 Kb) PagedPool Maximum: 43008 ( 172032 Kb) Shared Commit: 1150 ( 4600 Kb) Special Pool: 0 ( 0 Kb) Shared Process: 2939 ( 11756 Kb) PagedPool Commit: 7444 ( 29776 Kb) Driver Commit: 2219 ( 8876 Kb) Committed pages: 63588 ( 254352 Kb) Commit limit: 320147 ( 1280588 Kb)
Your output may be a bit different from this sample. (For example, multiprocessor systems will have five paged pools instead of the three shown here.) Out of all this stuff on the screen, only three lines really matter in this case:
Free System PTEs
Paged Pool Usage
Paged Pool Maximum
If your PTEs are really low, then you'll need to try to increase them. If your paged pool usage is almost at the paged pool maximum, then you'll need to increase it. If neither of these is true, then you're not experiencing a kernel memory-related bottleneck.
The default paged pool and system PTE levels are configured via the registry. Let's look at the system PTE entries first:
HKLM\System\CurrentControlSet\Control\Session Man- ager\Memory Management\SystemPages
This registry value allows you to (somewhat) control the number of system PTEs that Windows creates at boot time. A value of zero lets the system create however many it needs to, and a value of FFFFFFFF (Hex) tells the system that you want it to create the absolute maximum number of PTEs that it can. (However, with Terminal Services enabled in application mode, the system will always create as many as it can anyway.) Values anywhere between these two will cause the system to create the specified number of PTEs, although the system does reserve the right to do whatever it wants if the numbers you specify are too extreme.
Each PTE is 4K in size. For every PTE that you can afford to lose, you can add 4K to the size of your paged pool. If you have 10,000 free system PTEs, you can probably afford to lose 7000 (leaving you with 3000). If you have 14,000 free, you can afford to give up 11,000.
Take however many PTEs you can afford to give up and multiply that number by 4. Then, add that number to the size of your PagedPool Maximum as shown in the debugger. (Be sure to add it to the Kb value to the far right.) That number will be the size that you should set your paged pool to be. Open up your registry editor and set your value in the following location:
HKLM\System\CurrentControlSet\Control\Session Man- ager\Memory Management\PagedPoolSize
This registry location stores the value in bytes, not kilobytes. Therefore, multiply your calculated value by 1024 to get the number that you should enter here. When you enter the value, be sure to enter it in decimal format. The registry editor will automatically convert it to Hex format.
A value of zero will allow Windows to automatically choose the optimum size, and a value of FFFFFFFF (Hex) will tell Windows to maximize the size of the paged pool (at the expense of the ability to expand other areas, including system PTEs). A hex value anywhere in between gives Windows an idea of what size you'd like the paged pool to be, although (as with the PTEs) Windows reserves the right to ignore your setting. If you limit the paged pool to 192MB or smaller (0C000000 Hex), Windows will be able to use the extra space for other things (such as system PTEs). Most systems will never let the paged pool get bigger than about 500MB.
There's one more registry value that you should understand here:
HKLM\System\CurrentControlSet\Control\Session Man- ager\Memory Management\PagedPoolMax
The PagedPoolMax value specifies the maximum percentage of the paged pool that you want to be full before the system starts stealing space from the system file cache. The default value of zero will cause the system to reclaim space from the system file cache only when the paged pool is 80% full. If you want the system to wait until the paged pool is 90% full, then set this registry value to 90 (decimal). If you want the system to steal file system cache space when the paged pool is only 40% full, set this value to 40 (decimal).
The first thing you should do when tuning your kernel is to look at the values for these three registry keys. If any of these three values is not set to "0," then you should reset them all to zero, reboot the system, and run your test again.
If you have fewer than 3000 free system PTEs at this point then forget it, there's really nothing more you can do. If you have more than 3000 free system PTEs, you could try maxing out the paged pool size by changing the PagedPoolSize registry value to FFFFFFFF (Hex). Reboot your server and run your test again to see if your changes made a difference for the better.
The risk you run here is that your optimal system PTE and paged pool sizes might be somewhere in the middle of maxing out one or the other. The only way to avoid this is to use a kernel debugger to view the actual size of the paged pool.
At this point you can reboot your server and rerun your test to see if it made a difference. Keep in mind that Windows will override your manual setting if it doesn't like it, so it's possible that your registry editing won't change anything at all.
While we're still on the topic of kernel memory usage, we should take a second to address the various boot.ini file switches that can be used in Terminal Server environments. If you do an Internet search on performance of Terminal Servers, you'll come across different theories about how these switches should be used. Let's debunk the theories and look at how these switches really work.
There are two boot.ini switches you should know about: /3GB and /PAE.
As you recall, 32-bit Windows systems can address 4GB of memory, and that memory is split into two 2GB chunks, one for the kernel and one for each process. Quite simply, adding the "/3GB" switch to a boot.ini entry changes the way the server allocates the 4GB memory space. Instead of two, 2GB sections, using the "/3GB" switch changes the partition so that the kernel gets 1GB and each process gets 3GB.
This is useful for memory-hungry applications such as Microsoft Exchange or SQL Server. As you know, however, Terminal Servers have trouble with the Windows kernel due to the fact that it's by default limited to 2GB, and as you can imagine, limiting the kernel to only 1GB of virtual memory would have disastrous consequences in a Terminal Server environment. Besides, in order to use the full 3GB, an application has to be compiled in a special way, so it's not like adding the "/3GB" switch would affect "regular" applications anyway.
If your Terminal Server is booting up via a boot.ini entry with the /3GB switch, remove it immediately.
The "Physical Address Extensions" (PAE) boot.ini switch is used when 32-bit Windows Terminal Servers have more than 4GB of physical memory. Since the 32-bit Windows operating system can only address 4GB of virtual memory, systems with more than 4GB have to perform some fancy tricks to be able to use the physical memory above 4GB. These "fancy tricks" are enabled by adding the "/PAE" switch to the entry in the boot.ini file. If you're using a server with more than 4GB of RAM, then be sure that you have the "/PAE" switch in your boot.ini file. (In order to use more than 4GB of physical memory, you'll have to use Windows Server 2003 Enterprise or Datacenter Edition.)
In Windows 2000, busy Terminal Servers would often be limited as to the amount of users they could support when the registry ran out of space. Even though you could adjust the maximum size of the registry, you were still limited by the fact that the entire registry was loaded into the kernel's paged pool.
Fortunately, the architectural changes introduced in Windows Server 2003 (well, technically they were introduced in Windows XP) affect the way that Windows loads the registry into memory. In Windows Server 2003 environments, the registry is not stored in the paged pool, meaning that there is effectively no size limit to the registry. The registry only consumes 4MB of the paged pool space regardless of how large it actually is. (There is also no registry size limit setting in Windows 2003.)