Understanding Subqueries and Joins | Beginning Transact-SQL With SQL Server 2000 and 2005

As the SQL language evolved and was implemented in different products, a few different techniques have been devised for joining tables to match up related records. Although there are a few variations of these techniques, there are essentially three different ways to go about joining records from two different tables. The first two are different forms of join operations within a single SELECT statement. The third technique involves more than one SELECT statement, where one query encompasses a second SELECT statement. This is often called a subquery — a query within a query. The tables in a subquery can be independent of each other or can be related through some kind of matching expression. One technique, used to match rows of the subquery to a row or rows of the main query is often referred to as a correlated subquery. Subqueries are covered in Chapter 10. For now, I will concentrate on the bread and butter of SQL Server queries: the join operation.

Joining Tables in the WHERE Clause

This is still a popular technique in many other database products and is the most traditional method for joining tables. Because it's an older technique, it's often referred to as a legacy join. Although it is supported by SQL Server, this is not part of the recent ANSI-SQL standards and is not the recommended approach for SQL Server. You may encounter this syntax in existing code. In this example, both the Customer and SalesOrderHeader are referenced in the FROM clause, and the join operation is performed in the WHERE clause:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer, SalesOrderHeader WHERE Customer.CustomerID = SalesOrderHeader.CustomerID

The query returns 31,519 rows, comprised of a combination of records from the Customer and SalesOrderDetail tables. Figure 8-2 shows the first 24 of these rows.

image from book Figure 8-2:

This query implements an inner join, which is discussed very soon. The equals sign between each of the column references means that this query returns only rows where there are matching records in each of the tables. This is known as an equijoin, meaning that the values in two tables compared in the join operation must be equal to one another. In the "Outer Joins" section, you'll also see an example of the legacy version for this type of query.

Joining Tables in the FROM Clause

The same operation can also be performed using the ANSI standard method. In the FROM clause, the two tables are referenced with a JOIN statement followed by the ON keyword and the same column references used in the preceding example:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer INNER JOIN SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

The result is the same. If you view the execution plan for both of these queries, you'll note that they both cause SQL Server to perform exactly the same operations. There is no difference in time, cost, or efficiency. You can view the execution plan for a query by enabling this option in the Query menu. Run the query and the Execution Plan tab appears below the results window, as shown in Figure 8-3.

image from book
Figure 8-3:

Reading from right to left, each icon represents an operation. The records are retrieved from both tables and held in memory. Because each of the columns referenced in the JOIN statement are indexed, the query-processing engine chooses to scan and retrieve records using these indexes. Rows in the Customer table are retrieved using a separate, non-clustered index. Records in the SalesOrderDetail table are physically ordered by the CustomerID column based on a clustered index. The width of the arrows indicates the relative volume of data returned from the respective operation. The rows are combined using a hash join method to produce the final result. If you execute either of these two queries, you will see that the execution plans are the same. Float the mouse pointer over the left-most icon to see statistics for the finished product. The Subtree cost shows the total time in seconds for this and all operations that lead to it.

Two major types of joins exist: those that return only corresponding records in two tables or those that return all of the rows in one table and then corresponding rows in the second table.

The AdventureWorks2000 database contains 19,186 customer records. All but 66 have corresponding sales orders rows in the SalesOrderHeader table, where the CustomerID column value is equal to an existing value in the Customer table's CustomerID. There are 31,519 rows in the SalesOrderHeader table.

Inner Joins

This is the most common type of join operation. The purpose of the inner join is to match up rows in one table with corresponding rows in another table where the associated columns contain the same value. If one of these tables' columns has a different value, or no value at all, these rows will not be returned by the query.

Before showing you an example, I'll make this request using common language: I'd like to see all of the customers who have orders — and all of the orders that have corresponding customers. For each customer, show me the customer's account number, and for each order, the order date.

Again, the SQL statement that makes this same request is as follows:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer INNER JOIN SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

You just saw the results from this query in Figure 8-2, so I won't show them to you again. Just remember that it returns 31,519 rows that consist of customers with orders and orders with customers.

The INNER JOIN statement can also be abbreviated by simply using JOIN. Although this is not as explicit, it works just as effectively as the earlier statement:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer JOIN SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

Outer Joins

The job of an outer join is to return all of the rows from one table and then to match those rows in a corresponding table where the joining column has the same value. The difference between this and an inner join is that the unmatched rows in the first table are still returned by the query.

In common language, an outer join request might look like this: I'd like to see all of the customers, and for the customers who have orders, I'd also like to see related order information. Show me the account number for every customer, and if the customer has orders, show me a row for each combination of customers and orders.

Here's the SQL statement for this request:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer LEFT OUTER JOIN SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID

Outer joins always favor one table, the table from which you choose all rows. In this case, the table on the left side of the JOIN statement From Customer LEFT OUTER JOIN SalesOrderHeader is the Customer table. This means that all customer rows will be returned from the query and then the corresponding SalesOrderHeader rows. If you think about it, this makes sense because the SalesOrderDetail table has a foreign key constraint that requires a matching CustomerID value. Given the relationship between these tables, it wouldn't make sense to join them the other way around.

When you execute this query, the results will look much the same as before at a glance. However, notice the row count: 31,585 rows — 66 more than before. What's going on here? Go back and look at the numbers I gave you just before I introduced inner joins. The Customer table contains 31,519 rows including 66 without any orders. The outer join returned the customers who don't have orders. When an outer join doesn't have matching rows in the outer table (in a left outer join, the table on the right is the outer table), null values are returned. To find customers without orders, look for a null in the OrderDate column. But can't you use a query to do this? Rather than making you scroll through 31,000 rows looking for those missing an order date, just alter the query, adding a WHERE clause:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer LEFT OUTER JOIN SalesOrderHeader ON Customer.CustomerID = SalesOrderHeader.CustomerID  WHERE SalesOrderHeader.OrderDate IS NULL

This query returns 66 rows, customers who have no orders, as shown in Figure 8-4.

image from book Figure 8-4:

So, what is the purpose of a RIGHT OUTER JOIN? Using a Right in place of Left in this expression would have the same effect as reversing the order of the tables and columns. In most cases, you could choose to use one or the other. However, if you have an outer join on both sides of a table, you may be constrained to use either a left or a right outer join. Fortunately, visual query design tools can be used to create complex queries. This is easy to do using the Transact SQL Query Builder.

Try It Out

One of a few ways to open the Query Builder is to create a new view in Enterprise Manager.

Drill down into the AdventureWorks database and right-click the Views icon. From the pop-up menu, click New View. . .
Click the right-most button (Add Tables) on the toolbar to add tables to the designer.
From the Add Table dialog window, select the Customer table and click Add.
Select the SalesOrderHeader table (see Figure 8-5) and click Add. Click Close to close the Add Table dialog window.

Figure 8-5:

The designer always assumes you will want an inner join. The diamond on the join line between these tables represents the join, and the logic is quite simple.
Right-click the diamond icon to view a pop-up menu of join operations. From the menu, choose Select All Rows from Customer, as shown in Figure 8-6.

Figure 8-6:

The diagram adds a rectangular "cap" to the left side of the join to indicate that all rows will be returned from the table on the corresponding side of the join.
Now scroll down or resize the Customer table window and check the box next to the AccountNumber column. Also, for the SalesOrderHeader table, check the OrderDate column (see Figure 8-7).

Figure 8-7:

Take a look at the SQL statement in the third pane of the designer. It should look like this:

 SELECT   dbo.Customer.AccountNumber, dbo.SalesOrderHeader.OrderDate  FROM   dbo.Customer LEFT OUTER JOIN dbo.SalesOrderHeader ON dbo.Customer.CustomerID = dbo.SalesOrderHeader.CustomerID

The only real difference between your version of this query and this one is that the designer always adds the dbo user prefix. As far as you're concerned, this doesn't make a difference. This is a great tool for learning join syntax. When in doubt, build your queries this way and examine the SQL. Rather than using the designer as a crutch so you don't have to learn to do it the hard way, use it as a learning tool and then challenge yourself by rewriting the same queries in the Query Editor or Query Analyzer. As I mentioned earlier, there are those purists who refuse to use these design tools to create queries. In my opinion, it all comes down to time, money, and effort. If you can get the job done more effectively using a utility of some kind, then by all means, do so.

There is also shorthand syntax for outer joins. You can abbreviate the join statement by using LEFT JOIN or RIGHT JOIN rather than LEFT OUTER JOIN and RIGHT OUTER JOIN, respectively.

Legacy Outer Joins

Earlier, I showed you how to implement an inner join in the WHERE clause. You'll recall that this technique, often referred to as a legacy join, is not recommended by the ANSI-92 SQL standard. It will, however, work in Transact-SQL and produces the same execution plan. Legacy outer joins are even more discouraged than legacy inner joins. This is because ambiguous results can be returned by a legacy outer join when an expression is placed on the side of the join where all records are to be returned (the "*" side). Microsoft has been threatening for a few years to make the legacy outer join syntax an illegal operation due to this rare anomaly, but as of SQL Server 2005 it has not. I share it with you here because you may encounter it in existing code but you should probably never use this method of creating outer joins. This is the same query as the previous example using the legacy join syntax in the WHERE clause:

 SELECT Customer.AccountNumber, SalesOrderHeader.OrderDate  FROM Customer, SalesOrderHeader WHERE Customer.CustomerID *= SalesOrderHeader.CustomerID

The asterisk (*) indicates the equivalent of a Left or Right part of a join clause, where *= is a Left Outer Join and =* is a Right Outer Join.

Multicolumn Joins

There is no stated limit to the number of columns that can be used in a join. Typically, you would only need to use more than one joining column to support specific business rules (because joins are usually performed on primary and foreign keys). The logic of a join expression is very similar to that of a WHERE clause. Multiple comparisons can be combined using AND and OR operators.

I'll use this technique to find product sales records (in the SalesOrderDetail table) for products sold at dealer cost. This involves matching the ProductID between the two tables and matching the StandardCost from the Product table to the UnitPrice from the SalesOrderDetail table:

 SELECT     Product.ProductID , Product.Name , Product.StandardCost , Product.ListPrice , SalesOrderDetail.UnitPrice FROM       Product  INNER JOIN SalesOrderDetail  ON  Product.ProductID = SalesOrderDetail.ProductID  AND Product.StandardCost = SalesOrderDetail.UnitPrice

14,574 transactions were recorded without a price markup. The first 24 rows of this result are shown in Figure 8-8.

image from book
Figure 8-8:

Non-equijoins

So far, the join operations (in their various forms) you've seen have all used comparisons of equality. In other words, the values compared between two tables must be equal for the query to return matching records. Although far less common, joins can also be performed using any other valid method of comparison. This can include any of those listed in the following table.

Operator	Comparison
<>	not equal
<	less than
>	greater than
<=	less than or equal
>=	greater than or equal

I can modify the previous example to find sales orders for products that were sold below cost:

 SELECT     Product.ProductID , Product.Name , Product.StandardCost , Product.ListPrice , SalesOrderDetail.UnitPrice FROM       Product  INNER JOIN SalesOrderDetail  ON  Product.ProductID = SalesOrderDetail.ProductID  AND Product.StandardCost > SalesOrderDetail.UnitPrice

This returns 28,959 rows of sales orders that cost the business to sell, some of which are shown in Figure 8-9.

image from book
Figure 8-9:

Of course, I can easily turn this query around to show profitable sales by changing the comparison expression to read Product.StandardCost < SalesOrderDetail.UnitPrice. This expression returns 77,838 sales order/product rows.

Special-Purpose Join Operations

I think it's safe to say that you have seen 99% of the join operations you will use day-to-day. Two more types of joins are quite rare: full joins and cross joins. Unless you need to do some very unusual things, you will likely not use them. I can think of just three or four times I've used a full join or cross join in the past few years to solve unique problems.

Full Joins

A full join or full outer join is an outer join that doesn't favor one of the two tables. The result set will return unmatched values on both sides of the join. For example, the Customer table's SalesPersonID column is used to match an employee assigned to the customer. Of the 19,186 customers, 18,485 don't have an assigned salesperson. In a full outer join, you are asking for all customers who have or don't have an assigned salesperson and all employees who are assigned to a customer as well as those who are not. In other words, this query returns all customers, all employees, and combinations of the two based on the joining columns.

Try It Out

Execute the following query and scroll through the results:

 SELECT CustomerID , EmployeeID FROM Customer  FULL OUTER JOIN  Employee ON Customer.SalesPersonID = Employee.EmployeeID ORDER BY CustomerID, EmployeeID

In Figure 8-10, you will notice NULL values in the CustomerID column returned from the Customer table. If you scroll-down, you will also see NULL values in the EmployeeID column returned from the Employee table.

image from book Figure 8-10:

This query returned 19,463 results. This is essentially every possible combination of distinct values for the two columns participating in the join, including NULL values.

The last time I used this type of join was for a medical patient scheduling application. Business requirements called for the user interface to display the available appointment blocks for all doctors in the clinic. Doctors with appointments scheduled for them were to be displayed in a different color. For example, each of three doctors in the clinic could see patients scheduled for appointments that could begin every 15 minutes. An appointment slot table contained scheduling blocks beginning at 9:00 AM and ending at 4:45 PM. Each row in this table represented a 15-minute block (9:00, 9:15, 9:30, and so on). In the scheduling application, my client wanted to see every possible appointment for each doctor, including those that had appointments scheduled. A full join between the appointment slot table and the appointment table did the trick.

Cross Joins

This is the granddaddy of all joins. When using a CROSS JOIN, you don't designate columns for the join to match values. The query will simply return every possible combination of rows for two tables without regard for matching column values. Needless to say, this can produce a large volume of rows and could be an effective way to populate a test database table with sample data. Frankly I have yet to find a practical use for this technique in a production database.

If you were to implement a cross join between the Customer and SalesOrderHeader tables, this is what you should expect to see: The Customer table contains 19,186 records and the SalesOrderHeader table contains 31,519 rows. This means that for every Customer row, 31,519 rows will be added to the result set. If you do the math, the result will contain 604,723,534 possible combinations. Please do not run this query on a production server. If you do, please don't tell your system administrator what book you were reading when you learned to do this. Joking aside, running this query would take several minutes and would consume a fair amount of server resources.

Note

As with any query in the Query Analyzer or SQL Server Management Studio query editor utilities, if it is running too long, you can always click the stop button to abort execution.

Here is the syntax for this query:

 SELECT CustomerID, EmployeeID FROM Customer CROSS JOIN Employee  ORDER BY CustomerID, EmployeeID

In my experience cross joins are more often created by accident than by intent. This is especially true when using the Query Builder. If two tables are added to the diagram that do not have a relationship defined, the Query Builder will generate a cross join query automatically.

Filtering Records in the Join Clause Using a Predicate

This is a lesser-known technique that can be used to filter rows before a join operation is executed. When a traditional multi-table query is executed, the join is processed first and then filtering takes place afterward; at least this is what you're telling SQL Server to do. Fortunately, SQL Server is smart enough to perform filtering ahead of time when it makes sense to do this (as long as indexes and column statistics are up to date, which they should be under normal conditions). Regardless, this technique guarantees that records not matching the filtering criteria will not be considered in the join operation:

 SELECT CustomerID , EmployeeID , CustomerType FROM   Customer  INNER JOIN Employee ON Customer.SalesPersonID = Employee.EmployeeID AND Customer.CustomerID < 10 ORDER BY CustomerID, EmployeeID

However unconventional, this method would have the same effect as if the filter criteria (Customer. CustomerID < 10) were specified in the WHERE clause. For readability, I believe that the WHERE clause is the preferred method.

Joining on an Expression

Chapter 10 discusses the idea of using a SELECT statement as a derived table. For the purpose of the current discussion, it's good to know that joins can be used not only for tables, but also for any SELECT expression. An example so you can see how this works follows.

The director of marketing wants to reduce the cost of selling small-ticket items but doesn't want to discontinue low-priced items that are selling well. The director would like to see the accumulative sales for the 10 least-expensive products. This will require two separate SELECT expressions. The first will return the product rows for the 10 least-expensive products, and the second will return the aggregate sales filtered by this product selection. I'm working from the inside out, so the second expression will actually become the first part of the final query. I'll start with the first step and then add the second expression to the beginning.

The first query simply returns the 10 least-expensive products:

 SELECT TOP 10 ProductID , Name , ListPrice  FROM Product WHERE ListPrice IS NOT NULL ORDER BY ListPrice ASC

The ProductID column is essential but the other two columns shown in Figure 8-11 are included just for reference in the results.

image from book Figure 8-11:

This expression must be given an alias so I can refer to it in another SELECT expression. The second step is to create another query for the aggregated sales orders. The first query is assigned an alias name and then treated as if it were a physical table:

 ELECT  SalesOrderDetail.ProductID , CheapProducts.Name , CheapProducts.ListPrice , SUM(LineTotal) AS SalesTotal FROM SalesOrderDetail INNER JOIN (SELECT TOP 10 ProductID , Name , ListPrice FROM Product WHERE ListPrice IS NOT NULL ORDER BY ListPrice ASC) AS CheapProducts ON SalesOrderDetail.ProductID = CheapProducts.ProductID GROUP BY  SalesOrderDetail.ProductID , CheapProducts.Name , CheapProducts.ListPrice

The alias I created for the derived table is called CheapProducts. The traditional approach is to use a single letter for table aliases. Personally, I find this to be a bit cryptic, so I try to use short and meaningful names. To include columns from the inner products query in the outer result set, they are referenced using the alias, as if this were just another table. Finally, because I'm aggregating the LineTotal column for records sharing the same product, all of the columns except for the aggregate must be included in the GROUP BY column list. This works because each of these column values is unique to a specific product. The results are displayed in Figure 8-12.

image from book
Figure 8-12:

I'll take some more time to cover derived tables and other subquery techniques in the next chapter. This gives you an idea about using joins in creative ways to address unique business rules. What I've learned (and continue to learn) about using SQL to address unique challenges is that there is almost always a method to solve the problem — and there are often several options. The ideal solution is usually not all that complicated. However, the ideal and most elegant solution is often not the one I use on the first attempt.

Multi-Table Joins

Now, let's put it all together. I need a list of all stores and, for those store/customers that have purchased products, details of the order and the product information. In this database, a store is a type of a customer, which requires a join between the Store and Customer tables. Because I want all stores regardless of matching orders, this will require an outer join to the SalesOrderHeader table. All other tables are included through inner joins. Figure 8-13 provides an illustration.

image from book
Figure 8-13:

This query might be particularly tricky if you were to write it without the help of the designer. Because the SalesOrderHeader table participates in two joins, an inner join with SalesOrderDetail and an outer join with the Customer table, this breaks up the SQL syntax. Note the mispairing of JOIN and related ON statements in the SQL statement for this query:

 ELECT     Store.Name AS StoreName , Customer.AccountNumber , SalesOrderHeader.OrderDate , ProductCategory.Name AS Category , ProductSubCategory.Name AS SubCategory , Product.Name AS ProductName FROM       Store INNER JOIN Customer  ON Store.CustomerID = Customer.CustomerID LEFT OUTER JOIN SalesOrderDetail INNER JOIN SalesOrderHeader ON SalesOrderDetail.SalesOrderID = SalesOrderHeader.SalesOrderID INNER JOIN Product ON SalesOrderDetail.ProductID = Product.ProductID INNER JOIN ProductSubCategory ON Product.ProductSubCategoryID = ProductSubCategory.ProductSubCategoryID INNER JOIN ProductCategory ON ProductSubCategory.ProductCategoryID = ProductCategory.ProductCategoryID ON Customer.CustomerID = SalesOrderHeader.CustomerID ORDER BY Store.Name

When you need to write a complex query, it may be a good idea to at least start with the graphical query designer. Figure 8-14 shows the result set for this query from the results pane of the SQL Query Designer. Note the NULL values in the first row indicating that the store/customer has no related order records.

image from book
Figure 8-14:

Union Queries

Joins expand the result set horizontally. That is, columns are added to the results from multiple tables, essentially widening the result. A UNION query expands the results vertically as records are piled on top of one another. A simple example follows. The Employee table and the Individual table both contain records of people. To shorten the list, I'm just going to select the top five rows from each table:

 SELECT TOP 5 FirstName, LastName FROM Employee  SELECT TOP 5 FirstName, LastName FROM Individual

These two queries, even if executed at the same time, return two different result sets, as shown in Figure 8-15.

image from book
Figure 8-15:

Now, I'll put them together. A union combines multiple results with the same number of columns. Columns must have compatible data types. To keep things simple, I suggest that columns have the same names and the same data types. Because this is the case with these two queries, combining them is simple.

Try It Out

Enter and execute the following query:

 SELECT TOP 5 FirstName, LastName FROM Employee  UNION SELECT TOP 5 FirstName, LastName FROM Individual

Note that each SELECT statement could be executed as an independent query. Although not absolutely necessary, all of the columns have the same name. Figure 8-16 shows the results.

image from book
Figure 8-16:

The problem with this result set is that you have no way of knowing what table each of the rows came from. Depending on what you intend to do with this data, this may or may not be important. In case it is, you need to tag each row with a value indicating its source. This is easily accomplished by adding an alias in the individual SELECT statements:

 SELECT TOP 5 FirstName, LastName, 'I' AS Source FROM Individual  UNION SELECT TOP 5 FirstName, LastName, 'E' AS Source FROM Employee

By creating a Source alias, you provide a literal value. In the results shown in Figure 8-17, this indicates whether the person is an individual customer or an employee.

image from book
Figure 8-17:

The UNION operator when used by itself is actually a UNION DISTINCT. This means that the two queries are merged, sorted, and any duplicates removed. Notice the FirstName column in Figure 8-16. The values are in alphabetical order. This is the result of the implied DISTINCT/SORT operation that accompanies the UNION. On the surface this may seem like a good idea, and it may be, but the additional overhead of sorting and removing duplicates can be quite costly when used against large tables. For best performance, use the UNION ALL operator instead. The UNION ALL operator simply concatenates the two results together. Just keep in mind that if the same first name–last name combination existed in both the Employee and Individual table, it would appear twice.

This is kind of a cool trick — I can combine records from multiple tables to produce a unified result set. But what does this really buy me? What is the practical business application of a union? I'll show you a powerful yet simple way to solve what used to be a huge business problem in the following scenario.

Partitioning and Federating Data

In the early days of client/server computing, there were two different approaches to managing data. A database was either managed centrally, in a mainframe or midrange hosted environment; or data was stored in small-scale, file-based data stores. There really wasn't a middle ground between these two options. To compensate for the lack of desktop database scalability, programmers found creative ways to replicate and synchronize data, but none of these work-around solutions provided true concurrency.

Here's a scenario that nearly all of us can relate to: When you make a purchase with a credit card, the store uses a merchant service provider to authorize and transact the purchase. This could be a service offered by their bank, credit card company, or a third-party clearinghouse. Your card number is sent to a data center, a query is executed against a database to retrieve your account information, and then records are added to complete the transaction. Most likely, the bank doesn't have one central data center. A data center is a facility housing a group of computer servers. PC-based servers are often clustered together to form a single, virtual server. Essentially, it's a bunch of PCs lumped together, pretending to be a more powerful computer — almost like a mainframe. The bank probably has a few regional data centers, and merchants connect to one closest to them. For simplicity, assume that my bank has merchant services data centers in Seattle (west coast accounts) and Atlanta (east coast accounts).

I live, and my bank accounts are based, in the Seattle area. If I were to make a purchase at a local store — or a store anywhere in the western United States — the transaction would be processed in the Seattle data center. If someone whose bank account is managed in another data center were to make a purchase in the Seattle area, a connection from the local data center to the customer's home data center would be used to retrieve their account information. Here's how this might work.

If these accounts were managed in SQL Server, a remote server connection may be used to provide on-demand connectivity to a database in another regional data center. In this example, I use the server names EastCoastServer and WestCoastServer to represent the remote servers and LocalHost to represent the respective local server. At each data center, the local database contains a view (a stored SELECT query) called vw_AllAccounts. In the Seattle regional center, the SQL script for the view looks like this:

 SELECT AccountNumber, Balance, CreditLimit, CustomerName  FROM LocalHost.dbo.AccountDatabase.WestCoastAccounts  UNION SELECT AccountNumber, Balance, CreditLimit, CustomerName  FROM EastCoastServer.AccountDatabase.dbo.EastCoastAccounts

In Atlanta, a view with the same name looks like this:

 SELECT AccountNumber, Balance, CreditLimit, CustomerName  FROM LocalHost.dbo.AccountDatabase.EastCoastAccounts  UNION SELECT AccountNumber, Balance, CreditLimit, CustomerName  FROM WestCoastServer.AccountDatabase.dbo.WestCoastAccounts

In each of the regional databases, the local accounts table (either WestCoastAccounts or EastCoastAccounts) is used to manage accounts in that region. The account lookup is performed using a stored procedure that accepts the account number as an input parameter and then looks up a record by matching this value to the AccountNumber column. Because there will only ever be one matching record, the stored procedure's SELECT statement uses a TOP 1 modifier that causes the database to stop searching for records after it locates one. The stored procedure script looks like this:

 CREATE PROCEDURE spGetAccount @AccountNumber Int AS SELECT TOP 1 * FROM vw_AllAccounts

As you can see, this is just a simple SELECT statement with a parameterized value (@AccountNumber) that gets passed in when it is executed. Chapter 13 uses the views and stored procedure to discuss what happens when the credit card transaction is processed.

Performance and Scaling Considerations

Now, let's look at a smaller-scale implementation to improve performance and efficiently manage a large volume of data. In this scenario, sales records are stored in a single table. Perhaps several thousand records are inserted each month. To report on this data over a period of years, it would be necessary to keep perhaps millions of rows in the sales table. Managing such a large volume of data in an active database will inevitably slow performance and raise administrative costs. At first glance this may seem like a perplexing challenge.

Let's view the facts and consider the options to resolve this quandary. To effectively deal with performance issues, you must first identify the bottlenecks. What's typically the slowest part of a computer system? Nearly all system components are solid-state, route electronic signals moving at near the speed of light and rely on transistors switching pulses at millions of times per second. Computers process data at incredible speeds but continue to store it on spinning disks, using mechanical armatures to write and read values. The hard disk is almost always the slowest part of a system. When retrieving data, other components wait while the disk controller finds and moves the read/write head to the right track and then waits for the disk to rotate to the sectors containing the data bits, assembling interlaced values as the disk rotates multiple times. The head is relocated again until all data is retrieved. You can't speed up the disk, but you can spread data across multiple disks to speed up the overall process.

In a partitioned storage solution, data is split up onto multiple disks. This affords SQL Server the opportunity to retrieve data in parallel and make more efficient use of multiple disk drives. There are actually a few different approaches to multiple-disk storage. In this scenario, I'll present an unsophisticated but very effective technique using multiple tables and a Union query.

At the end of every year, a routine removes year-old sales records from the current sales table and inserts them into a specific archive table containing only that year's sales. Each of these tables is placed on a separate physical disk drive. Over a few years, there may be several tables. Suppose this system has been in place for five years, and the disks and tables are set up as shown in Figure 8-18.

image from book
Figure 8-18:

For reporting, a view is created that contains a series of UNION statements:

 SELECT * From CurrentSales UNION ALL SELECT * From 2003Sales UNION ALL SELECT * From 2002Sales UNION ALL SELECT * From 2001Sales UNION ALL SELECT * From 2000Sales

When this query is executed, all of these disk drives may be accessed at the same time. Assuming there are no other performance barriers, this could theoretically improve performance by a factor of five. Not only is this query going to run faster, but it will also ease the burden on competing operations rather than blocking processes and other users' requests. You'll see some more specific examples of this type of query in the discussion on creating database objects in Chapter 13.