Now that you understand how to calculate disk capacity from a SQL Server perspective, it is time to put that knowledge into planning and action. Disks require configuration before they can be presented to Windows and, ultimately, SQL Server. A few tasks must be performed prior to configuring a disk in Windows to allow SQL Server to use it.
How many disks do you need to get the performance you desire and the space you need? Calculate the number of spindles (actual disks) based on the application data you have gathered or should have available to you.
Remember when figuring out the number of spindles needed to take into account how many will make up a LUN. Some hardware vendors allow storage to be expanded only by a certain number of drives at a time, so if you need to add one disk, you might actually need to add six. Also, just because your storage vendor supports a certain number of spindles per LUN, it does not mean you should actually use that many disks. Each vendor has its own optimal configuration, and you should work with the vendor to determine that configuration.
This is the easiest of the calculations. For this, you need to know the following:
The number of concurrent users
The amount of data each user is returning
The time interval used by the user
Assume you have 250 concurrent users. Each needs to simultaneously bring back 0.5 MB of data in less than one second. This means that in total, you must be able to support constantly bringing back 125 MB of data from your disk subsystem in less than a second, every second, to get the performance you desire.
It is time to calculate the throughput of a single physical drive. You must know the following information:
Maximum I/Os supported for the physical drive in its desired configuration (that is, RAID level, sector size, and stripe size are factors).
Ratio of 8k page reads to 64k readaheads for the operations. You might not know this information, but it is ultimately useful. You can use the numbers shown here as a rough estimate if you do not know. Each drive is rated at a certain number of I/Os that the hardware manufacturer should be able to provide.
Continuing from step 1, assuming 80 percent 8k page reads and 20 percent 64k readaheads, and knowing that the drive can support 100 I/Os per second:
(80 8) + (20 64) = 1.6 MB/second throughput on the individual drive
Individual drive throughput might or might not be as important with SANs as it is with DAS. It depends. Because high-end storage subsystems might have lots of cache space, physical performance can be optimized, compensated for, or masked by the architecture of the solution. This does not mean you should buy lower speed 5400 rpm speed disks. It just means you should understand the physical characteristics of your disk subsystem.
Now that you have the amount of data being returned and the throughput of the individual drive, you can calculate the number of drives needed. The formula is:
Number of Spindles = Data Returned/Drive Throughput
Continuing the example started in step 1, 125/1.6 = 78 spindles. Realistically, not many companies buy 78 disks just to achieve optimum performance for your data. This does not even take into account space for indexes, logs, and so on. More important, this does not even take into account the type of RAID you will use. This is just performance, so you can see where trade-offs come into play.
Disks are physical hardware, and understanding their benefits and limitations will help you during the planning process. Each manufacturer s hardware is different, so sometimes it is hard to perform a comparison, but when it comes to implementation, understanding the disk technology you are employing enhances your ability to administer disks and make them available.
Here are some key points to consider:
How fast are the disks that will be fitted in the enclosure? This is measured in rpm, and in most cases, faster is better because it translates directly into faster seek and transfer times. However, many fast- spinning drives generate heat and greater vibration , which translates into planning at a datacenter level (that is, when you buy hardware, make sure it has proper cooling; with lots of systems that spin lots of drives in one physical location, you need sufficient cooling).
What is the rated MTBF for the drives? Knowing what the average life span of a drive is helps mitigate any risks that might arise from a disk failure. If you put an entire disk solution in place with drives, remember that they all have the same MBTF, which is affected by their individual usage. This might or might not be a factor for your environment, but it is worth mentioning.
Does the disk solution support hot-swappable drives? If it does not, what is the procedure to fix the drive array with a disk failure? If one disk drive fails, you do not want to affect your availability by having to power down your entire system just to add a new drive in. Taking your service level agreements (SLAs) into account, you can easily fail to meet them if you do not ask your hardware vendor this type of question up front.
When you venture into physical design of a disk subsystem, you must understand how the servers you bought contribute to performance, and how the disk drives themselves work.
Hard disks are platters with a head that reads and writes the data, and then sends it across some protocol such as SCSI or Fibre Channel. No two drives are the same, and some are faster than others. Even drives rated at the same speed might perform differently in different circumstances. One important concept with modern hard disk drives is media banding, which occurs when the drive s head progresses from the outside track closer into the center of the drive, and data delivery rates drop. When you architect for performance, you need to understand exactly where on the physical platters your data could be sitting and how physical placement on the disk affects how fast your disk returns the data to you. Things like media banding are relevant today, but as storage moves toward virtualization, it might not be as important. Storage virtualization is basically the decoupling of the logical representation from the physical implementation. It allows users to present a logical storage infrastructure to operating systems and applications that is independent of how and where the physical storage is located. This will definitely change any planning that you would make, should you employ virtualization.
Your choice of a storage vendor will govern the architecture implemented, ranging from small and fast to large and slower, or anything in between. Individual drive characteristics might not be a choice, as the vendor solution is optimized for their use of disks.
Once you get past the drives, look at the physical architecture of your system. Historically, as you move farther away in the signal chain from the processor, bandwidth decreases, which is shown in Figure 4-2. If you have a 800-MHz processor with 512 MB of PC133 SDRAM, talking over a 300 MHz bus that has a SCSI controller that can deliver up to 40 MB per second, and finally to a single drive that can sustain 15 MB per second, you have a pyramid with its smallest point, or tip, at the disk drive. Changing that picture a bit, changing one or more components of the system, or ensuring, say, a faster front side bus, might actually result in better performance. For example, does your solution need multiple SCSI controllers to fill your existing bus, or just a fatter, higher MHz 64-bit bus?
Next , break it down to your choice of hardware: what does each component bring to the table, and how can it possibly hinder you from a performance or availability standpoint? Very few people actually take the time to understand their hardware to this level, but in the availability and performance arenas, you must.
SQL Server has three types of disk I/O: random, readahead, and sequential. Random I/O is the 8k reads from a page, whether it is data or index. An example of a random I/O is a query with a WHERE clause. Readahead I/O issues 64k reads, and this maps to table and index scans. Table and index scans might be advantageous because the I/O cost of retrieving 64k when contiguously placed on the disk is close to that of retrieving 8k when paying the disk seek time between the next 8k. The final type of I/O, sequential I/O, is what the transaction log uses. Basically, you want the head to never pick up from the disk and just have it write to the log in a continuous stream. If the head has to pick up and go elsewhere, and then go back to the next position in the virtual log file, your writes to the log are slower, so reducing or eliminating latency for logs allows log records to be written faster. When SQL Server writes to a disk, it writes at the 8k page level (data or index). One of the main reasons SQL Server writes 8k pages is that these are database engine pages, not memory pages. 8k is a memory-address page boundary aligned requirement to use the high-speed I/O employed by SQL Server.
It is extremely important to consider the page size of 8k when designing schemas and indexes. Will all data from one row fit on a single page? Can multiple rows fit on a page? What will happen if you are inserting or updating data and you need to do a page split, either in data, or more to the point, in an index? Avoid page splits if at all possible because a page split causes another I/O operation that is just as expensive as a full 8k write. Keep this in mind with Unicode, as noted earlier, because it has slightly different storage characteristics.
Online transaction processing (OLTP) systems are heavy-duty read/write systems such as e-commerce systems. There are more random reads and writes with OLTP systems, and a safe assumption is at least 100 I/Os per second. The disk controller and cache should be matched to the ratio of reads and writes (see the next section, Understanding Disk Cache ). For an OLTP system, it is potentially better to have a larger number of smaller drives to achieve better performance. One large disk (or LUN) over a smaller amount of disks might not result in the performance you want. To optimize the log, you might want to consider a dedicated controller set to 100 percent write as long as the transaction log is on a dedicated disk.
Do not attempt such a configuration unless you are certain that you are only using the write cache; otherwise you might encounter data corruption.
OLTP systems can quickly become log bound if your application performs many very small inserts , updates, and deletes in rapid succession (and you will see hotspotting on the log files if this is the case), so if you focus on the number of commits per second, ensure that the disk containing the log can handle this.
Consider analyzing your percentage of reads versus your percentage of writes. If you have more reads than writes, you might want more indexes, stripes with more spindles, and more RAM. If you have more writes than you do reads, large stripes and RAM will not help as much as more spindles and individual LUNs to allow the spindles to write without contention with other write I/O requests . Write cache can also help, but make sure you are completely covered when using the write cache for RDBMS files in general, and specifically , your log.
Planning for data warehouses is different than planning for OLTP systems. Data warehouses are largely based on sequential reads, so plan for at least 150 I/Os per second per drive due to index or table scans. The controller and cache should be set to heavy read. Put your data and indexes across many drives, and because tempdb will probably be heavily used, you should consider putting it on a separate LUN. Do not try to be a hero for object placement, because the query performance profile will change over time with use, addition of different data, new functionality, and so on. Achieving maximum performance would include having a dedicated performance and maintenance staff, but if that is not possible, design for maximizing I/O bandwidth to all disks and ensure proper indexes. Try to make the top 80 percent of your queries perform at optimal levels, and realize that 20 percent will be the outliers. You probably will not be able to have 100 percent of your queries perform optimally ” choose the most used and most important, and then rank them on that basis. Log throughput is not as important, as data can be loaded in bulk-logged recovery mode, but it is still important for recoverability.
Taking technology out of the picture for a second, you need to understand the ratio of reads to writes in your individual application to be able to set the disk cache at the hardware level if possible (that is, you cannot do it if you are on a cluster and are using SCSI). If you have a 60 percent read, 40 percent write application, you could configure your cache to that ratio if your hardware supports this type of configuration. However, if you have more than one application or even multiple devices, say, attached to a SAN, optimizing your disk cache is nearly impossible if you have multiple applications sharing a LUN. Many SANs cache at the LUN level, but if they do not, your cache is shared among all LUNs, and you will only be able to do the best you can. This might also force you to have separate disk subsystems for different systems or types of systems to increase performance.
No discussion about availability and disks would omit a discussion on RAID. Over the years , RAID has stood for different things, but from a definition standpoint, it is the method of grouping your disks into one logical unit. This grouping of disks is called various things depending on your hardware vendor: logical unit, LUN, volume, or partition. A RAID type is referred to as a RAID level. Different RAID levels offer varying amounts of performance and availability, and each one has a different minimum number of physical disk drives required to implement it. There are many levels of RAID, and each manufacturer might even have its own variation on some of them, but the most popular forms of RAID are introduced here.
Not all vendors support all the choices or options listed here for RAID, and some might even force you into a single, optimized method for their storage platform. Understand what you are purchasing.
Disk striping, or RAID 0, creates what is known as a striped set of disks. With RAID 0, data is read simultaneously from blocks on each drive. Because you only get one copy of the data with RAID 0, there is no added overhead of additional writes as with many of the more advanced forms of RAID. Read and write performance might be increased because operations are spread out over multiple physical disks. I/O can happen independently and simultaneously. RAID 0 spreads data across multiple drives, which means that one drive lost makes the entire stripe unusable. Therefore, for availability, RAID 0 by itself is not a good choice because it leaves you exposed without protection of your data. RAID 0 is bad for data or logs, but might be a good solution for storing online backups made by SQL Server before they are copied to tape. However, this is a potential risk, as you would need to ensure whatever process copies the backup to another location works, because losing a disk would mean losing your backup.
Disk mirroring, or RAID 1, is simple: a mirrored set of disks is grouped together. The same data is simultaneously written to both drives, and in the event of one drive failure, the other disk can service requests (see Figure 4-3). Some vendors support more than two disks in a mirror set, which can result in something like a triple or quadruple mirror. Read performance is improved with RAID 1, as there are two copies of the data to read from (the first serving disk completes the I/O to the file system), and unlike RAID 0, you have protection in the event of a drive failure. RAID 1 alone does have some high availability uses, which are detailed in the section File Placement and Protection later in this chapter. The disadvantage of mirroring is that it requires twice as many disks for implementation.
RAID 1 is the most recommended form of RAID for SQL Server logs, providing you always maintain enough disk space for your log and its growth on the mirror.
Figure 4-3: Mirroring.
Striped mirrors are a hybrid of disks that are first mirrored with RAID 1, and then all mirrors are striped with RAID 0, as shown in Figure 4-4. Striped mirrors offer the best availability of the options presented in this chapter, as they can tolerate a drive failure easily and provides good fault tolerance. You could lose up to half of your disks without a problem, but losing the two disks constituting a mirror pair renders the entire stripe unreadable. Assuming no problems, you also have the ability to read from each mirror. When a failed disk is replaced , the surviving disk in the mirrored pair can often regenerate it, assuming the hardware supports this feature. Striped mirrors require a minimum of four drives to implement and can be very costly because the number of disks needed is doubled (due to mirroring). Striped mirrors are the most common recommendation for configuring disks for SQL Server.
Mirrored stripes are similar to striped mirrors, except the disks are first striped with RAID 0, and then each stripe is mirrored with RAID 1 (see Figure 4-5). This offers the second best availability of the RAID options covered in this chapter. You can tolerate the loss of one drive in each stripe set, as that renders that stripe set unusable. Your risk for failure goes up significantly once you encounter one drive failure. Mirrored stripes require a minimum of four drives to implement, and this solution is very costly, as you usually need at least two times the number of drives to get the physical space you need.
Striped mirrors and mirrored stripes often confuse people, and manufacturers might call them RAID 10, RAID 0+1, RAID 1+1, RAID 01, and other terms. Work with your manufacturer to understand which one they actually implement for their disk subsystems.
Striping with parity, shown in Figure 4-6, is also known as RAID 5, but it is parity that is not contained on one dedicated drive (that would be RAID 3); it is written across all of the disks and takes up the equivalent of one drive. RAID 5 is probably the most popular RAID level, as it is often the most cost effective in terms of maximizing drives for space.
From an availability standpoint, RAID 5 is not the best ”you can only tolerate the loss of one drive. When you lose a drive, your performance suffers, and once you lose more than one drive, your LUN no longer functions. RAID 5 uses the equivalent of one drive for storing a parity bit, which is written across all disks (hence only being able to lose one drive). RAID 5 also has slower write speeds, and so would not be appropriate for, say, a heavily used OLTP system. That said, many fast physical disks might mitigate (or mask) some RAID 5 performance issues. Caching capabilities of storage subsystems might also mitigate this so that it is no longer a point of consideration.
In a situation with unlimited financial and disk resources, striped mirrors give you better performance and reliability. In a limited disk scenario (for example, having only 12 or 16 disks), RAID 5 might be a faster or better implementation than striped mirrors. Of course, that would need to be evaluated.
A physical disk controller (hardware, not software) has two channels. Striped mirrors and mirrored stripes require both to mirror. RAID 5 could use two truly separate LUNs to separate out log and data, and you could access each individually, potentially increasing performance. If you have more than one disk controller and implement striped mirrors or mirrored stripes, you could get the same effect.
Assume you have 12 disks. With striped mirrors or mirrored stripes, you would only have six disks worth of space available because your storage is effectively halved. Even if the hardware can support simultaneous reads (that is, reading from both or all disks in the mirrored pair if you have more than two mirrors), you cannot have simultaneous writes from the application perspective because at a physical level, the hardware is performing simultaneous writes to maintain the mirror. Therefore, in some cases, RAID 5 with a smaller number of disks might perform better for some types of work.
Parity is a simple concept to grasp. Each physical disk has data (in megabytes or gigabytes), which then breaks down into smaller segments, ultimately represented by zeroes or ones. Picture a stripe, of say, 8 and you get the following:
1 0 1 0 0 0 1 1 (also known as A3 in hexadecimal)
Parity is simply the even or odd bit. For even parity, the parity bit would be 0 because if you sum the preceding numbers, you get four, and if you divide by two, there is no remainder.
Now that you understand the basics of parity, what does it buy you? Assume you lose the third disk, which would now make the example look like this: 1 0 x 0 0 0 1 1. If you then sum these numbers and divide by two, you now have odd parity. Therefore, you know you are missing a drive that must have contained a 1 to make the parity check.
Beyond the different types of RAID, there are two implementation forms: hardware-based and software-based. Hardware-based RAID is obviously implemented at a physical level, whereas software-based RAID is done after you start using the operating system. It is always optimal to have RAID done at a physical level.
Software RAID is not supported for a clustered system. You must use hardware RAID. Software RAID is an option with a third- party product such as Veritas Volume Manager but at that point, the third-party vendor would become the primary support contact for any disk problems.
For high availability, nothing can really beat a solution that allows you to geographically separate disks and maintain, say, two SQL Servers easily. This is more often than not a hardware-assisted solution. Remote mirroring is really the replication of storage data to a remote location. From a SQL Server perspective, the disk replication should be transactional based (as SQL Server is). If replication is not based on a SQL Server transaction, the third-party hardware or software should have a process (not unlike a two-phase commit) that ensures that the disk block has been received and applied, and if not, you are still maintaining consistency for SQL Server. If you are considering remote mirroring, ask yourself these questions:
Which disk volumes do you want to replicate or mirror?
What are the plans for placing the remote storage online?
What are the plans for migrating operations to and from the backup site?
What medium are you going to use for your disk replication? Distance will dictate this, and might include choices such as IP, Fibre Channel over specific configurations, and Enterprise Systems Connection.
What is the potential risk for such a complex hardware implementation?
Torn pages are a concern with remote disk mirroring. SQL Server stores data in 8k pages, and a disk typically guarantees writes of whole 512-byte sectors. A SQL Server data page is 16 of these sectors, and a disk cannot guarantee that each of the 16 sectors will always be written (for example, in the case of a power failure). If an error happens during a page write, resulting in a partially written page, that is known as a torn page . Work with your storage vendor to learn how they have planned to avoid or handle torn pages with your SQL Server implementation.
Each storage subsystem has a certain physical design with a high but limited number of disk slots. These are divided across multiple internal SCSI buses and driving hardware (processors, caches, and so forth), often in a redundant setup. If we now construct, for instance, a RAID 5 set but the same internal SCSI bus drives two spindles, that would be a bad choice for two reasons:
A failure on that internal bus would fail both disks and therefore fail the entire RAID 5 set on a single event. The same is true for two halves of a mirror, and it also applies to other redundant constructs of a storage subsystem.
Because RAID 5 uses striping, driving all spindles concurrently, two spindles on the same internal bus would need to wait on each other because only one device at a time on a single bus can be commanded. There is little waiting, as internally this happens asynchronously, but it still represents two consecutive operations that could have been one across all spindles at the same time, as is the intention of striping.
For these reasons, among others, some vendors do not offer these choices and abstract that to a pure choice in volume sizes, doing everything else automatically according to best available method given a particular storage solution design.
Windows supports three kinds of disks: basic, dynamic, and mount points.
A basic disk is the simplest form of disk: it is a physical disk (whether a LUN on a SAN or a single disk physically attached to the system) that can be formatted for use by the operating system. A basic disk is given a drive letter, of which there is a limitation of 26.
A dynamic disk provides more functionality than a basic disk. It allows you to have one volume that spans multiple physical disks. Dynamic disks are supported by both Windows 2000 and Windows Server 2003.
A mount point is a drive that can be mounted to get around the basic 26-drive letter limitation. For example, you could create a drive that is designated as F:\SQLData, but maps to a local drive or a drive on a disk array attached to the server.
Only basic disks are supported for a server cluster using the 32-bit versions of Windows and SQL Server 2000 failover clustering. Basic disks are also supported under all 64-bit versions of Windows. Mount points are supported by all versions of Windows Server 2003 for a base server cluster, but only the 64-bit version of SQL Server supports mount points for use with failover clustering. Dynamic disks are not supported natively by the operating system for a server cluster under any version of Windows, and a third-party tool such as Veritas Volume Manager can be used. However, at that point, the third-party vendor would become the primary support contact for any disk problems.
Windows supports the file allocation table (FAT) file system, FAT32, and the NTFS file system. Each version of Windows can possibly update the version of the particular file system, so it is important to ensure that would not have any adverse affects in an upgrade process. For performance and security reasons, you should use NTFS.
Only NTFS is supported for server clusters. FAT and FAT32 cannot be used to format the disks.
Keep in mind when that using Windows 2000 and Windows Server 2003, the 32-bit versions have a maximum limitation of 2 TB per LUN with NTFS. 64-bit versions remove this limitation, so if you are building large warehouses, you should consider 64-bit versions.
Do not enable compression on any file system that will be used with SQL Server data or log files; this is not supported in either stand-alone or clustered systems. For a clustered SQL Server, an encrypted file system is not currently supported. Windows Server 2003 clusters at the operating system level do support an encrypted file system. For updates, please consult http://support.microsoft.com .
How you format your disks affects the performance of your disks used for SQL Server data. Disk subsystems optimized for a file server or for Exchange are not optimized for SQL Server. As discussed in the earlier section Understanding How SQL Server Interacts with Disks, if your file system is not designed for SQL Server, you might not be able to get maximum performance. Whether using Computer Management or the command line format , set the NTFS block size appropriately (see Figure 4-7); 64K is recommended, and with 64-bit systems, this can even be set higher. If going beyond 64K, try it first on a test system. For SAN- based systems, make sure you work with your hardware vendor to ensure you are using the optimal stripe size settings for the type of RAID you employ.
Make sure that you work closely with your preferred hardware vendor to ensure that the disks configured for your database are exactly the way you want them to be configured and that they are not optimized for other types of disk use. It is important to get the configuration documented for verification. In addition, if a problem occurs, you should have a guarantee from the vendor that the disks were configured properly. It can also help Microsoft Product Support Services or a third- party vendor assist you in the troubleshooting of any disk problems.