Lesson 3: Optimizing Data Storage

Estimated lesson time: 25 minutes

The optimization of data storage is dependent on why you are optimizing as well as what your users' requirements are. Sometimes, you might be optimizing data storage to save disk space (which is typical for data warehouses), and other times, you might want to save memory or optimize query performance.

Optimizing Row Width

One of the ways to optimize a database to save disk space is simply to remove seldom-used indexes that allocate a lot of disk space. In Lesson 2, the section titled "Analyzing Index Usage" covered the details regarding which indexes to remove. Another important task is to optimize the row width of large indexes and tables. SQL Server stores data on 8-kilobyte pages, and it is important to have as little unused space as possible in each data page. Note that this type of optimization is typically only useful on tables that contain wide rows-that is, those that are at least over 400 bytes per row. Each data page has a 96-byte header and 8,096 bytes available for rows (data) and row offsets (the pointers stored at the end of each page telling where each row begins). If a row cannot fit into the free space of a data page, it will need to be stored in a new data page. This means that, for some tables and indexes, the size of the rows might cause data pages not to be filled. The less that each data page is filled, the more disk space is wasted. If the table isn't very big (1 GB, for example), you might not care so much about the price of disk space, but consider that everything located on the disk will be placed in memory when it is accessed. If the page space is poorly used, it will use more memory than necessary, and memory is far more expensive than disk space.

Data stored using variable-length data types (VARCHAR, NVARCHAR, and VARBINARY) can be moved to overflow pages if they cannot fit into the row's original data page.

The following is an example of one table (Test.BigWaste) with a row width that causes a large amount of unused space in each data page and another table (Test.OnlyALittleWaste) that almost fills each data page completely. Each table is populated with 10,000 rows.

 CREATE TABLE Test.BigWaste (     Txt CHAR(3000) NOT NULL ); INSERT Test.BigWaste (Txt)     SELECT TOP(10000) text     FROM sys.messages; CREATE TABLE Test.OnlyALittleWaste (     Txt CHAR(4000) NOT NULL ); INSERT Test.OnlyALittleWaste (Txt)     SELECT TOP(10000) text     FROM sys.messages;

Note that, for the purpose of this example, the empty space caused by a column using a large char data type is not considered to be waste. Now, the sys.dm_db_index_physical_stats database management function is used to retrieve information on the storage of these two tables:

 SELECT     OBJECT_NAME(object_id) AS TableName     ,page_count     ,ROUND(avg_page_space_used_in_percent, 2) AS avg_page_space_used_in_percent FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('Test.BigWaste'), NULL, NULL, 'DETAILED') UNION ALL SELECT     OBJECT_NAME(object_id) AS TableName     ,page_count     ,ROUND(avg_page_space_used_in_percent, 2) AS avg_page_space_used_in_percent FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID('Test.OnlyALittleWaste'), NULL, NULL, 'DETAILED') ORDER BY TableName;

The output of the previous query is shown in the following table.

Open table as spreadsheet

TableName	page_count	avg_page_space_used_ in_percent
BigWaste	5,000	74.33
OnlyALittleWaste	5,000	99.04

Note that both tables use exactly the same number of pages to store their data, even though each row in the OnlyALittleWaste table is 1,000 bytes wider than in the BigWaste table. As you can see, in this example, almost 26 percent of the BigWaste table's size is wasted on empty page space. If this table was 10 GB in size, 26 percent would amount to 2.6 GB.

De-Normalizing

De-normalizing a database design implies storing data redundantly. The benefit of this is typically less work for the database engine when reading data, while it means more work when writing data because of the need to update more places than one. De-normalizing can also result in bugs when redundant data is not updated accordingly.

De-normalizing a database design should typically be used as a last resort. That said, denormalization can increase performance dramatically. In fact, you could say that indexing, and especially indexed views, is de-normalizing at the physical level. The great benefit with de-normalizing at the physical level is that the database engine will maintain the redundant data for you.

An example of de-normalization is to store the finished result of certain queries in a table, and instead of executing these queries again, to just read the result directly from that table. If this is what you need, you should probably first look at the possibility of caching the result of these queries in the application. Another use of de-normalization is to redundantly store columns from a parent table in a child table, for example:

 USE AdventureWorks; -- Add the column to the child table. ALTER TABLE Sales.SalesOrderHeader     ADD CustomerAccountNumber VARCHAR(10) NULL; GO -- Update the child table's column with values from the -- parent table. UPDATE soh SET     CustomerAccountNumber = c.AccountNumber FROM Sales.SalesOrderHeader AS soh INNER JOIN Sales.Customer AS c ON soh.CustomerID = c.CustomerID;

Now the following query can be executed without joining to the parent table:

 SELECT     CustomerAccountNumber     ,SalesOrderID     ,OrderDate FROM Sales.SalesOrderHeader;

To keep the values in the CustomerAccountNumber column up-to-date, you will need to use an update trigger on the Sales.Customer table or, at certain intervals, execute the previous update statement.

You should generally use de-normalization when the redundant data (the CustomerAccountNumber column in the previous example) does not always need to be synchronized with the source data. That is, that the redundant data is updated periodically, such as once a day or every 30 minutes.

Data Types

You can think of selecting data types as a two-step process. The first step is deciding which data type is appropriate, and the second step is making sure that the usage of the data type, for the specific purpose, is consistent throughout the organization or at least throughout the database. The second step is usually the most difficult of the two.

Deciding on the Appropriate Data Type

When you decide on the appropriate data type to implement, you should typically at least consider the following:

What needs to be stored in the column?

The data type must support the data that it needs to store. For example, the nvarchar data type requires more space than the varchar data type, but if you need to store Unicode data, you must use the nvarchar data type.
How much storage space does the data type use?

The less storage space needed, the less disk space and memory will be required by the table in which the data type is used.
Will the column be indexed?

If the column will be indexed, it is important that the data type be as small as possible to allow for effective indexes.

Using Data Types Consistently

Selecting data types is a time-consuming process when implementing and designing databases. Also, inconsistent use of data types throughout a database can amount to big problems. To facilitate the consistent use of data types, you should consider the use of alias types (also known as user-defined data types). An alias data type is just what it sounds like: an alias for a system data type. You can create one data type for each type of data in your tables at a more granular level than merely text or numbers. The following example illustrates the use of alias data types:

 CREATE TYPE Test.PHONE_NUMBER FROM VARCHAR(20); CREATE TYPE Test.PERSON_NAME FROM NVARCHAR(50); GO CREATE TABLE Test.Employees (     Firstname Test.PERSON_NAME NOT NULL     ,Lastname Test.PERSON_NAME NOT NULL     ,HomePhoneNumber Test.PHONE_NUMBER NULL     ,FaxNumber Test.PHONE_NUMBER NULL )

Lab: De-Normalizing an Aggregation

In this lab, you will de-normalize the Sales.Customer table to include a column with the sum of purchases for each customer. You will then create a trigger to maintain the redundant data stored in this column. This lab is intended to show the expense of automatically updating redundant columns, especially if these columns contain aggregated data. Instead of using a trigger, for example, this column could have been updated by a SQL Server Agent Job once per day. The completed lab is available in the \Labs\Chapter 07\Lab3 folder on the companion CD.

Open SQL Server Management Studio and connect to an instance of SQL Server 2005.

Open a new query window and type and execute the following data definition language (DDL) statement to disable an existing trigger on the Sales.SalesOrderDetail table:

 USE AdventureWorks; ALTER TABLE Sales.SalesOrderDetail     DISABLE TRIGGER iduSalesOrderDetail; IF(OBJECT_ID('Sales.Customer') IS NOT NULL)     ALTER TABLE Sales.Customer         DISABLE TRIGGER uCustomer;

Turn on Include Actual Execution Plan in SQL Server Management Studio by pressing Ctrl+M.
Type and execute the following dummy update statement to test its performance. This update statement is executed to check the query cost of updating an order detail row prior to the implementation of the de-normalization:
```
 UPDATE Sales.SalesOrderDetail SET     OrderQty = OrderQty WHERE SalesOrderDetailID BETWEEN 1 AND 1000; 
```
Note the cost of the update from the graphical execution plan.
Type and execute the following DDL statement to add the column that will store the total sales for each customer:
```
 ALTER TABLE Sales.Customer     ADD TotalSales MONEY NULL; 
```

Type and execute the following statement to create the trigger that will maintain the TotalSales column on the Customer table. This trigger is created on the Sales.SalesOrderDetail table because it is changes to that table that will affect a customer's total purchases. The trigger will recalculate the entire TotalSales figure for each customer whose order details have been updated; it does not add or subtract from the TotalSales column because of the risk for inconsistencies (which, for example, could be caused by the trigger being turned off for a period).

 CREATE TRIGGER trgSalesOrderDetailDenorm ON Sales.SalesOrderDetail AFTER INSERT, UPDATE, DELETE AS BEGIN SET NOCOUNT ON; UPDATE c SET     TotalSales = (                 SELECT ISNULL(SUM(sod.LineTotal), 0)                 FROM Sales.SalesOrderDetail AS sod                 INNER JOIN Sales.SalesOrderHeader AS soh                     ON soh.SalesOrderID = sod.SalesOrderID                 WHERE soh.CustomerID = c.CustomerID                 ) FROM Sales.Customer AS c -- Only update customers whose order details have been updated. INNER JOIN Sales.SalesOrderHeader AS soh     ON soh.CustomerID = c.CustomerID WHERE     EXISTS (SELECT * FROM inserted AS i             WHERE i.SalesOrderID = soh.SalesOrderID)     OR     EXISTS (SELECT * FROM deleted AS d             WHERE d.SalesOrderID = soh.SalesOrderID); END

Type and execute the dummy update statement again to test its performance after the implementation of the trigger:
```
 UPDATE Sales.SalesOrderDetail SET     OrderQty = OrderQty WHERE SalesOrderDetailID BETWEEN 1 AND 1000; 
```
Note the cost of the update from the Execution Plan tab.

Because there are two separate plans displayed in the Execution Plan tab, one for the update statement, and one for the update statement executed by the trigger, the new cost of the update will be the sum of the cost of these updates.
Test the accuracy of the trigger by inserting, deleting and updating rows in the Sales.OrderDetail table and validating the changes to the TotalSales column in the Sales.Customer table for the affected customer or customers.

To clean up after this lab, close all open query windows in SQL Server Management Studio, open a new query window, and execute the following SQL statements:

 USE AdventureWorks; ALTER TABLE Sales.SalesOrderDetail     ENABLE TRIGGER iduSalesOrderDetail; IF(OBJECT_ID('Sales.Customer') IS NOT NULL)     ALTER TABLE Sales.Customer         ENABLE TRIGGER uCustomer; DROP TRIGGER trgSalesOrderDetailDenorm; ALTER TABLE Sales.Customer DROP COLUMN TotalSales;