Optimizing the Database | MicrosoftВ® .NET Distributed Applications: Integrating XML Web Services and .NET Remoting (Pro-Developer)

The data tier is the part of a distributed application where information is centralized. That means it's one of the most difficult parts of a distributed application to scale out because it maintains state. It's worth squeezing out every extra ounce of performance you can.

Modern database products have come a long way. Microsoft architects confess that, only a few years ago, almost all the leading PC-based relational database management system (RDBMS) products had major limitations with reliability and data integrity. Today these problems have been fixed across the board, and all enterprise-level databases provide a sophisticated set of features for optimizing the speed that queries and updates execute.

Unfortunately, these considerations require an entirely separate book and are to some degree vendor-specific. I heartily encourage you to read a book that's targeted to your RDBMS. If you're using SQL Server, you might want to look at SQL Server 2000 Performance Tuning Technical Reference (Microsoft Press, 2001), which considers every aspect from capacity planning to index tuning. You also can find white papers on MSDN at http://msdn.microsoft.com/library/en-us/optimsql/odp_tunovw_9mxz.asp.

With that in mind, the next sections consider some more general guidelines.

Stored Procedures

You should use stored procedures whenever possible. Not only can stored procedures be precompiled with an optimized execution plan, but they also tend to limit the different types of queries that will be performed on a data source. If you shift from a system that uses 40 possible dynamic queries to one that uses a set of 20 stored procedures, for example, it becomes much easier to optimize the database.

Indexes

Use indexes. Quite simply, indexes are the only way that a database server can retrieve a specific row without performing a time-consuming full scan through every record. However, modern databases are brilliant at combining existing indexes in various ways to perform complex queries. This process is fairly involved, so you should not decide yourself which indexes to create. Instead, use the tools included with the RDBMS. In SQL Server, the Index Tuning Wizard can recommend new indexes based on a "snapshot" of the system workload or by analyzing a set of problematic (slow) queries. Keep in mind that having too many indexes can also slow down performance because it requires that more data be stored (and that the average query traverse more space on the hard drive). This is another reason you need a utility such as the Index Tuning Wizard to recommend an optimum set of indexes.

It's also worth considering a clustered index. A typical, nonclustered index is just a long list of row pointers sorted according to the values in one or more data columns. For example, the index for an ID column is just a long list sorted by ID number. Using the list, SQL Server can quickly locate a specific ID, read the pointer that indicates where to find the full record, and jump to that position. A clustered index, however, determines the physical order of the data on the disk. If you create a clustered index for an ID column, records with lower ID values will actually be stored before records with higher ID values. Because you can physically arrange data only in one way, there can be only a single clustered index on a table. With SQL Server, the primary key index is by default clustered, which is a decidedly mixed blessing.

Clustered data is a great help with range queries. If your data is clustered by date, for example, a query that selects a range of dates from a single month will find all its data close together on the disk, decreasing transfer times. Of course, in an elaborate system with multiple RAID drives, this is not always an advantage. Clustering too much related data close together can lead to disk contention as multiple queries fight to access the same area on the disk. This is known as a hot spot. There is no easy answer to this tradeoff. It's best to work closely with your database administrator and weigh the options.

Of course, life isn't always this painful. Consider a table that consists of Invoice Number, Invoice Date, Invoice Amount, and Sales Office columns. Now suppose that 10,000 records are inserted into this table every day and users most often query data based on Sales Office. This makes Sales Office an ideal column for a clustered index. However, because the new rows will likely originate from a range of different sales offices, inserts will still be spread evenly across the table, reducing the hot spot danger.

Profiling

Most modern RDBMS products include profiling tools. With SQL Server, you can use the included Profiler, which provides a slew of useful features:

You can monitor the performance of SQL Server and keep track of issues such as failed connections or denied logins.
You can identify slow-executing queries, which pose a potential problem for your system.
You can debug Transact-SQL statements and stored procedures.
You can capture a log of events and replay them later on another system. You can even use this snapshot to determine the optimal set of indexes for your data.

In short, tools such as Profiler provide you with some very useful tools for gauging performance and the effect of database optimizations. Use them.

Partition Large Data Tables

You can use horizontal or vertical partitioning of large data tables to increase processing speed. Horizontal partitioning segments a large table into multiple tables, each containing a subset of the total rows. One table might contain records for the current year, for instance, whereas another table might contain records with an Older Date column. Alternatively, you might divide a customer table based on the customer's last name.

Vertical partitioning segments a table with a large number of columns into multiple tables that contain the same number of rows but only some of the columns. These rows are linked by matching a unique row identifier contained in both tables. This approach is most successful if the separate tables are used independently in queries. If one table contains all the information normally required by a query, for example, the query will execute much faster. The other table can store data that is used only occasionally.

In a large environment, partitioning can extend across several database servers. This presents the application programmer with new challenges. Ideally, you should strive to meet two goals:

Distribute the workload evenly.
Ensure that every server contains enough information to satisfy an average query.

If you divide order records over two tables, for example, your first goal is to split the traffic between these two servers so that they both serve approximately the same number of requests. If you divide a Customers table by last name, you'll accomplish this goal. If you divide an Orders table by date, you probably won't because there will be much more interest in recent orders than in extremely old orders. If you perform a query that retrieves a single customer or order record, the request will be satisfied by a single database server. If you perform a query that retrieves a group of orders by date, you'll also likely need only one server. If you try to retrieve all the orders submitted by a given individual, however, you might need to read order information from more than one server, which will increase overhead and slow down the query.

Once again, these considerations are not trivial and require careful consideration when you're creating the physical infrastructure of a system. When creating a system that spans multiple servers, you need to work closely with an experienced database administrator. Large databases are possible, however. Microsoft touts its live example of a 1.5-terabyte database of satellite imagery, which is available at its TerraServer Web site (http://terraserver.homeadvisor.msn.com).

Placing Code in the Data Tier

Most distributed applications use stored procedures to wrap simple SQL statements that correspond to queries, or to record updates, inserts, or deletions. However, it's possible to create a much more intelligent stored procedure that can perform several related tasks (and even enforce business rules).

If the time taken to perform this logic is short (and it usually is), you can achieve a substantial performance improvement by reducing the number of required round-trips. For example, most applications use separate stored procedures to add order records (which link an order to a customer) and individual order item records (which detail each purchased product). That means it takes 10 round-trips to commit an order with nine order items. If you create an intelligent stored procedure, it can perform the work required to insert all these records simultaneously, dramatically reducing overhead.

However, stored procedure programming isn't without a healthy share of headaches. Almost every RDBMS uses a vendor-specific form of SQL that forces you to use primitive script commands to complete your work. Some of the problems you'll encounter include the following:

The lack of robust error handling. You must rely on cryptic error codes instead.
The need to get work at a lower level by manually creating and managing cursors.
No direct support for arrays, making it difficult to accept a variable list of items. One common workaround is to encode this information in some sort of XML document.
Primitive support for debugging, code reuse, and other object-oriented niceties.

In a future version of SQL Server, Microsoft aims to enable developers to code stored procedures using object-oriented .NET code. This has the potential to dramatically change the development model of distributed applications. Until then, you're probably better off placing your data logic in a dedicated .NET data component (if your system is not extremely performance-sensitive).

Normalizing

Normalization is the process by which you break data down into multiple tables. For example, it's customary to split the table of order information from the table that contains the customers who place the orders. This normalized form has several advantages:

It reduces duplicate data (which indirectly improves performance).
It eliminates the possibility for conflicting data.
It greatly simplifies some queries and data analysis tasks. If you were to place customer information in the order table, you would require some extremely awkward SQL code to coax out a distinct list of customers. It would be even harder (and extremely inefficient) to modify customer information because it would exist in multiple order records.
It speeds up index creation and modification because normalized tables generally have fewer columns. This can also improve the performance of Update commands.
It allows better locking because locking works on a row basis. With normalization, the data is factored into more than one table, and a typical transaction has to lock only some of the rows in one of the tables.

That said, deviating from good normalization practices is sometimes a worthwhile and practical tradeoff. Normalized databases generally require more complex queries. Before retrieving data, an application often needs to perform intricate subqueries or multiple joins. In addition, these steps can actually reduce performance both because the required data is fragmented into several different places on the hard drive and because the table joining adds extra time.

Selective denormalization can improve performance. However, the associated cost (generally a loss of flexibility, future scalability, data integrity, or even performance for other tasks) is often too high. For these reasons, I rarely recommend denormalization in fact, I mention it here only because it is a recurrent issue of interest to developers and a valid performance-optimizing tactic. For more information about the risks involved and how you might code around them, I encourage you to read the article "Responsible Denormalization" by Michelle Poolet in SQL Server magazine, which is provided online by Microsoft at http://msdn.microsoft.com/library/en-us/dnsqlmag2k/html/ResponsibleDenormalization.asp.