Partitioned Views | Microsoft SQL Server 2005 Stored Procedure Programming in T-SQL & .NET

Views can be a very useful tool for managing very large databases (VLDBs). Typically, data warehouse systems contain huge volumes of uniform data. A textbook example is a retailer that collects information about sales over years. Some analyses would process many years of data, but others would focus on only a few months or the current year. If everything were in a single table, queries and management of data would become increasingly difficult. In such a scenario, the retailer's sales information would be split into several horizontally partitioned tables such as Orderltem2005, Orderltem2006, and Orderltem2007. For analyses (queries) that span all tables, you can create a view that puts them all together:

      Create View dbo.vOrderItem      as      select * from dbo.0rderItem2005      UNION ALL      select * from dbo.0rderItem2006      UNION ALL      select * from dbo.0rderItem2007

Horizontal and Vertical Partitioning

Views based on multiple instances of the same table are called partitioned views. A horizontal partitioning occurs when different subsets of records are stored in different table instances (as in the preceding example).

It is also possible to do vertical partitioning—to put columns in separate tables based on the frequency with which they are needed. On "wide" tables, each record occupies a substantial amount of space. Since each data page is limited to 8KB, a smaller number of records can fit onto a single data page. As a result, the number of IO operations needed to access a large number of records is much higher. To reduce it, we can put frequently used fields in one table and other fields in a second table. The tables will have a one-to-one relationship. In the following example, the InventorySum table has been split into InventoryPrim and InventorySec tables:

      CREATE TABLE [dbo] . [InventoryPrim] (         [Inventoryid]     [int]       NOT NULL ,         [Make]            [varchar]   (50) NULL ,         [Model]           [varchar]   (50) NULL ,         [Location]        [varchar]   (50) NULL ,         [FirstName]       [varchar]   (30) NULL ,         [LastName]        [varchar]   (30) NULL ,         [UserName]        [varchar]   (50) NULL ,         [EqType]          [varchar]   (50) NULL ,         CONSTRAINT [PK_InventroyPrim] PRIMARY KEY CLUSTERED         (            [Inventoryid]         )  ON [PRIMARY]      ) ON [PRIMARY]      GO     CREATE TABLE [dbo] . [InventorySec] (        [Inventoryid]     [int]        NOT NULL ,        [AcquisitionType]   [varchar]         (12) NULL ,        [Address]           [varchar]         (50) NULL ,        [City]              [varchar]         (50) NULL ,        [ProvinceId]        [char]            (3)  NULL ,        [Country]           [varchar]         (50) NULL ,        [EqType]            [varchar]         (50) NULL ,        [Phone]             [typPhone]             NULL ,        [Fax]               [typPhone]             NULL ,        [Email]             [typ Email]            NULL ,        CONSTRAINT [PK_InventorySec] PRIMARY KEY CLUSTERED        (           [Inventroyid]         )   ON [PRIMARY]      ) ON [PRIMARY]      GO

The following creates a view that joins them:

      create view dbo.vInveotyVertSplit      as      select IP.Inventoryid, IP.Make,       IP.Model,             IP.Location,    IP.FirstName,  IP.LastName,             IP.UserName,    IP.EqType,     ISec.AcquisitionType,             ISec.Address,   ISec.City,     ISec.ProvinceId,             ISec.Country,   ISec.Phone,    ISec.Fax,             ISec.Email      from dbo.InventoryPrim IP      full join dbo.InventorySec ISec      on IP.Inventoryid = ISec.Inventoryid

The following creates a query that references only fields from one table:

      SET STATISTICS PROFILE ON      SELECT Make, Model, Location, UserName      FROM dbo.vInventoryVertSplit      where Inventoryid = 1041

In this case, SQL Server realizes that there is no need to access both tables:

      Make      Model                       Location       UserName      --------  --------------------------- -------------- ---------      Compaq    18.2GB 10K RPM Ultra2 Disk Dr Royal Hotel  PMisiaszek      (1 row(s) affected)      Rows Executes StmtText      ---- -------- -------------------------------------------------      1    1        SELECT [Make]=[Make],[Model]=[Model],[Location]=[Locatio      1    1         |--Clustered Index Seek(OBJ:(Assets.dbo.InventoryPrim.P      (2 row(s) affected)

Unfortunately, this is not always the case. If we remove criteria from the previous Select statement, SQL Server accesses both tables (although the columns selected are in only one table):

      Rows Executes StmtText      ---- -------- -------------------------------------------------------      980  1        SELECT [Make] = [Make] , [Model] = [Model] , [Location] = [Locatio      980  1         |--Merge Join(Full Outer Join, MERGE:(InventorySec.Inve      980  1           |--Clustered Index Scan(OBJ:(Assets.dbo.InventorySec.      980  1           |--Clustered Index Scan(OBJ:(Assets.dbo.InventoryPrim

Note

You can always force SQL Server to use just one table if you reference the table and not tbe view.

Distributed Partitioned Views

If all base tables of a view are stored on a single server, it is called a local partitioned view. If the underlying tables of a view are stored on separate servers, it is called a distributed partitioned view. Distributed partitioned views are always created on tables that are horizontally partitioned. In the following example, the vSales view on server Alpha references tables on servers Beta and Gamma:

      Create view dbo.vSales      as      select * from Sales.dbo.0rderItem2007      UNION ALL      select * from Beta.Sales.dbo.0rderItem2006      UNION ALL      select * from Gamma.Sales.dbo.OrderItem2005

That's the basic idea, but it is not that simple. I will discuss all the details and then show a complete example.

Servers that host horizontally partitioned tables and that work together are called federated servers. This technology was one of the major new features of SQL Server 2000 and it has allowed Microsoft to beat the competition consistently on TPC-C benchmarks since it became available in the Beta version of SQL Server 2000.

The strategy of splitting the transaction and query load among a set of distributed servers is often called scaling-out (as opposed to scaling-up, which refers to the brute-force method of simply applying bigger and faster hardware instead).

Partitioned tables are split based on a partitioning key—a column that determines which of the partitioned tables/federated servers the record will fall into. In the previous example, the partitioning key is a year. A partitioning key should be selected to ensure that the majority of queries are served from a single table/server. The success of a federated server project depends largely on the selection of an appropriate partitioning key.

Note

You do not have to have multiple physical servers to test federated servers. You can install several instances of SQL Server on the same machine to develop and test the solution. Naturally, it would be pointless to implement federated servers that way in a production environment.

By way of example, assume that your Asset database is serving a Canadian company and that it is functionally divided into three divisions—one serves the Canadian market, the second the U.S. market, while the third serves the international market (see Figure 8-3). This schema is very good when reporting is typically done per division.

image from book
Figure 8-3: Federated servers

You will partition the table using the Country column as a partitioning key. To assist the resolution of the distributed partitioned view, make Country the first field of the primary key and create Check constraints to prevent entry of records from an incorrect geographic location. It is very important that data ranges do not overlap and that a single record can end up only on a single server. The following Create Table statements should be executed on the respective servers:

      --on Canada server      CREATE TABLE [dbo].[InventoryCanada] (          [Inventoryid] [int] NOT NULL ,          [Make] [varchar] (50) NULL ,          [Model] [varchar] (50) NULL ,          [Location] [varchar] (50) NULL ,          [FirstName] [varchar] (30) NULL ,          [LastName] [varchar] (30) NULL ,          [AcquisitionType] [varchar] (12) NULL ,          [Address] [varchar] (50) NULL ,          [City] [varchar] (50) NULL ,          [ProvinceId] [char] (3) NULL ,          [Country] [varchar] (50) NOT NULL ,          [EqType] [varchar] (50) NULL ,          [Phone] [typPhone] NULL ,          [Fax] [typPhone] NULL ,          [Email] [typEmail] NULL ,          [UserName] [varchar] (50) NULL ,          CONSTRAINT [PK_InventoryCanada] PRIMARY KEY CLUSTERED           (             [Country],             [Inventoryid]           ) ON [PRIMARY] ,           CONSTRAINT [chkInventoryCanada] CHECK ([Country] = 'Canada')      ) ON [PRIMARY]      GO      -----------------------------------------------------------------      -- on US server      CREATE TABLE [dbo].[InventoryUSA] (          [Inventoryid] [int] NOT NULL ,          [Make] [varchar] (50) NULL ,          [Model] [varchar] (50) NULL ,          [Location] [varchar] (50) NULL ,          [FirstName] [varchar] (30) NULL ,          [LastName] [varchar] (30) NULL ,          [AcquisitionType] [varchar] (12) NULL ,          [Address] [varchar] (50) NULL ,          [City] [varchar] (50) NULL ,          [ProvinceId] [char] (3) NULL ,          [Country] [varchar] (50) NOT NULL ,          [EqType] [varchar] (50) NULL ,          [Phone] [typPhone] NULL ,          [Fax] [typPhone] NULL ,          [Email] [typEmail] NULL ,          [UserName] [varchar] (50) NULL ,          CONSTRAINT [PK_InventoryUS] PRIMARY KEY CLUSTERED          (             [Country],             [Inventoryid]           } ON [PRIMARY] ,           CONSTRAINT [chkInventoryUSA] CHECK ([Country] = 'USA')      ) ON [PRIMARY]      GO      ------------------------------------------------------------------      -- on World server      CREATE TABLE [dbo].[InventoryWorld] (          [Inventoryid] [int] NOT NULL ,          [Make] [varchar] (50) NULL ,          [Model] [varchar] (50) NULL ,          [Location] [varchar] (50) NULL ,          [FirstName] [varchar] (30) NULL ,          [LastName] [varchar] (30) NULL ,          [AcquisitionType] [varchar] (12) NULL ,          [Address] [varchar] (50) NULL ,          [City] [varchar] (50) NULL ,          [ProvinceId] [char] (3) NULL ,          [Country] [varchar] (50) NOT NULL ,          [EqType] [varchar] (50) NULL ,          [Phone] [typPhone] NULL ,          [Fax] [typPhone] NULL ,          [Email] [typEmail] NULL ,          [UserName] [varchar] (50) NULL ,          CONSTRAINT [PK_InventoryWorld] PRIMARY KEY CLUSTERED          (             [Country],             [Inventoryid]          ) ON [PRIMARY] ,          CONSTRAINT [chkInventoryWorld] CHECK ([Country] in ('UK',                     'Ireland', 'Australia'))      ) ON [PRIMARY]      GO

Create linked servers that reference all other servers that will participate in the distributed partitioned view on each server. In the current example, on server Canada, you need to create linked servers that reference the USA and World servers; on server USA, create linked servers that reference the Canada and World servers; and on the World server, create linked servers that reference the Canada and USA servers.

      exec sp_addlinkedserver N'(local)\USA', N'SQL Server'      GO      exec sp_addlinkedserver N'(local)\WORLD', N'SQL Server'      GO

Note

As you can see, I am running these statements against three instances of SQL Server running on the same physical machine.

To achieve better performance, it is necessary to set each linked server with the Lazy Schema Validation option. In the current example, on server Canada, you should execute

      USE master      EXEC sp_serveroption ' (local)\USA', 'lazy schema validation', 'true'      EXEC sp_serveroption '(local)\World', 'lazy schema validation', 'true'

Other servers should be set with the option for their linked servers. After that, the partitioned view will request metadata that describes the underlying table only if it is really needed.

Create distributed partitioned views that reference the local table and two tables on remote servers. On server Canada, you should execute

      use Asset5      GO      Create view dbo.vInventoryDist      as      select * from Asset5.dbo.InventoryCanada      UNION ALL      select * from [(local)\USA].Asset5.dbo.InventoryUSA      UNION ALL      select * from [(local)\World].Asset5.dbo.InventoryWorld      GO

Now, you can test the distributed partitioned view. Figure 8-4 shows the query that calls all three servers. The highlighted part of the result shows that the view is redirecting parts of the query to other servers. Figure 8-5 shows execution of the same query against database objects.

image from book
Figure 8-4: Usage of distributed partitioned view

image from book
Figure 8-5: Distributed partitioned view connects to tables on all member servers.

Note

Complete code for the creation and linking of all three databases on all three servers used in this example can he downloaded from the Trigon Blue web site: http://www.trigonhlue.com.

Execution Plans of Distributed Partitioned Views

If the query contains a criterion based on the partitioning key, SQL Server evaluates which servers contain matching data and executes the query against them only:

      set statistics Profile ON      select * from dbo.vInventoryDist      where Country = 'UK'      Rows Executes StmtText      ---- -------- -------------------------------------------------------      154         1 SELECT * FROM [dbo].[vInventoryDist] WHERE [Country]=@1      154         1 |--Compute Scalar(DEFINE:([InventoryWorld].[Inventoryid      154  1 |--Clustered Idx Seek(OBJ:([Assets].[dbo].[InventoryWo

The profile shows that the query was executed on the local server. In a case in which the data resides on another server, the profile would look like this:

      set statistics Profile ON      select * from vInventoryDist      where Country = 'CANADA'      Rows Executes StmtText      ---- -------- -------------------------------------------------------      872          1 SELECT * FROM [dbo]. [vInventoryDist] WHERE [Country]=@1      872          1    |--Compute Scalar(DEFINE:([(local)\CANADA].[Assets].[d 1      872          1         |--Remote Query(SOURCE:((local)\CANADA), QUERY:(SE

Figure 8-6 shows how the view will route the query to the remote server.

image from book
Figure 8-6: Distributed partitioned view routes the query to the remote server.

It is necessary to create partitioned views on two other servers with identical names. In that way, an application can get the data through the view on any of the servers. The views will reroute the query to the server that contains the data that is needed (see Figure 8-7).

image from book
Figure 8-7: Data-dependent routing

The system will achieve better performance if the partitioned view does not have to perform query routing—that is, if the application knows which server contains the data needed and therefore sends the query to the appropriate server. This technique is often called data-dependent routing.

Note

If the application is that intelligent, you might wonder why you need distributed partitioned views. Well, not all queries can he served from a single server. Some queries require data that is located on more than one server, and a distributed partitioned view would give you access to it.

The selection of a partitioning key and implementation of a Check constraint have a critical impact on the performance of the system. You would have seen this fact demonstrated had you implemented the partitioning key using the nonoptimizable argument (an argument that leads the optimizer to create an execution plan that will ignore indexes):

      . . .      CONSTRAINT [chkInventoryWorld] CHECK ([Country] <>'USA'                                       and [Country] <>'Canada'))      ) ON [PRIMARY]

In such a case, SQL Server cannot determine where data is located and the query will always be routed to the World server as well:

      Rows Executes StmtText      ---- -------- -----------------------------------------------------      872          1 SELECT * [dbo].FROM [vInventoryDist] WHERE [Country]=@1      872          1  |--Concatenation      872          1    |--Remote Query(SOURCE:((local)\CANADA), QUERY:(SELECT      0            1      |--Clustered Idx Seek(OBJ:(Asset.dbo.InventoryWorld.P

As you can see, the query was executed unnecessarily on one of the servers—no records were returned. SQL Server compares Check constraints with the partition key ranges specified in the Where clause and builds the execution plan accordingly.

You might think that SQL Server won't do such a good job when stored procedures are used against distributed partitioned views. It is true that SQL Server does not know which parameter will be specified in the stored procedure, and therefore it creates an execution plan that runs the query against all servers. However, the plan will have dynamic filters that serve as conditional logic and execute only the queries that are needed. To demonstrate, I will create a stored procedure that references the view:

      CREATE PROCEDURE dbo.ap_Inventory_ListDist         @chvCountry varchar(50)      AS      SELECT *      FROM dbo.vInventoryDist      WHERE Country = @chvCountry

I will now execute it so that you can review the profile:

      set statistics Profile ON      exec dbo.ap_Inventory_ListDist 'CANADA'

The execution plan will contain queries against all tables:

      Rows Executes StmtText      ---- -------- ------------------------------------------------------      872         1 select * from vInventoryDist where Country = OchvCoun      872         1   |--Concatenation      872         1   |    |--Clustered Index Seek(OBJECT:([Asset].[dbo].[      0           1   |--Filter(WHERE:(STARTUP EXPR([@chvCountry]='USA'}}}      0           0   |    |--Remote Query(SOURCE:(.\USA), QUERY:(SELECT C      0           1   |--Filter(WHERE:(STARTUP EXPR([@chvCountry]='Ireland      0           0        |--Remote Query(SOURCE:(.\World), QUERY:(SELECT

But two of these queries are not executed (as you can see in the Executes column).

Updateable Distributed Partitioned Views

Data can be modified through a distributed partitioned view:

      set xact_abort on      update vInventoryDist      set UserName = 'unknown'      where UserName is null      and Country = 'Canada'

I needed to set the Xact_Abort option because each such statement is treated as a distributed transaction. Therefore, Distributed Transaction Coordinator must be running on each server. The result will look like this:

      (2 row(s) affected)      Rows Executes StmtText      ---- -------- ----------------------------------------------      0           1 update vInventoryDist set UserName = 'unknown' where Use      0           1 |--Sequence      0           1   |--Remote Query(SOURCE:((local)\CANADA), QUERY:(UPDAT      0           1      |--Filter(WHERE:(STARTUP EXPR(0}}}      0           0      |  |--Remote Query(SOURCE:((local)\USA), QUERY:(UP      0           1      | --Clustered Index Update(OBJECT:(Asset.dbo.Invent      0           1          |--Filter(WHERE:(STARTUP EXPR(0}}}      0           0             | --Clustered Index Seek(OBJECT:(Asset.dbo.Inv

Note that SQL Server has again created dynamic filters, and only the appropriate queries will be executed, although they are all incorporated in the execution plan.

Unfortunately, views and modification statements have to satisfy additional requirements to allow modifications in that manner. I will mention only the most interesting and most restrictive ones:

Member tables on other servers must be referenced using four-part names, the OpenRowset function, or the OpenDataSource function. These functions must not use pass-through queries.
Member tables must not have triggers and cascading deletes or updates defined.
All columns of a member table must be included in the distributed partitioned view. The order of the columns must be identical.
The column definitions in all base tables must match (data type, scale, precision, and collation).
Ranges of partition key values in member tables must not overlap.
There can be only one Check constraint on the partitioning column and it may use only these operators: BETWEEN, AND, OR, <, <=, >, >=, =.
Tables cannot have identity values (otherwise, Insert statements will fail).
Partitioning keys cannot have defaults, allow nulls, be computed columns, or be timestamp values.
smallmoney and smalldatetime columns on remote tables are automatically converted to money and datetime columns. Since all data types must match, the local table must use money and datetime. To avoid confusion, it is best not to use smallmoney and smalldatetime.

It is sometimes possible to work around some of these rules—you can create an Instead-of trigger to modify the member tables directly. Unfortunately, in that case, query Optimizer might not be able to create an execution plan as good as the one that would be created for a view that follows all the rules.

Scalability and Performance of Distributed Systems

Federated servers and distributed partitioned views are not a magic bullet that will solve all your problems. Note that distributed partitioned views are primarily designed to improve scalability of the system, not its performance. Although these two parameters might seem similar to you, there is a significant difference. Performance refers to the speed of execution of the system (or of individual transactions), while scalability refers to the ability to increase transactional load or the number of concurrent users without significant performance degradation. For example, if a metric describing system performance is 100 percent on a single server, adding another server might cause performance to fall to, for example, 50 percent. In this case, end users would notice improvements only after a third server is added (3x50 percent). But the advantage is that we now have a system with nearly linear scalability—every additional server would increase performance by another 50 percent.

Federated servers (like other distributed database systems) are also more difficult to manage. Even "simple" operations such as backups and restores become very complicated. Promotion of hot fixes or new code in a production environment requires significant manual intervention or development of specialized tools.

It is very important to evaluate the pros and cons of this design before you start. A rule of thumb is that all other options should be explored and exhausted first, and only then should scaling-out be attempted. The game plan should be something like this:

Optimize the database and the application.
Scale up the server.
Scale out the system.

A Poor Man's Federated Server

It is possible to create a distributed system without the use of distributed partitioned views. For example, if you are in a business that has only a few, large customers (with a heavy transactional load), you could create a single database per customer instead of storing all transactions in the same database. Then, you could divide databases between several servers or install each one on a separate dedicated server. An application can be designed to direct each query or transaction to the appropriate server—to perform data-dependent routing.

A similar design would be applicable for an organization that can easily be partitioned into its suborganizations, based, for example, on geographic locations.

The key requirement is that there be no (or very little) need to aggregate data on the complete system—that is, to run queries (or transactions) that span multiple servers.