Designing for Disconnected Data

 

Designing for Disconnected Data

Before writing your code, you must make some key design decisions that will affect your ability to work with disconnected data: how much data should be loaded, how the data will be updated when many related tables are involved, and (probably the most important decision) which type of primary key you will implement. We will look at each of these in detail, using the following scenario.

Joe is a traveling salesperson who needs to maintain a list of customers and orders for his territory. While he is on the road, he can modify the data and store it to disk. When he returns to the office, he sends his changes to the main database server and retrieves updated information as well. If any concurrency conflicts arise, Joe is prompted to select the correct data. Our example uses a simplified order entry database that contains five related tables, as shown in Figure 5-1.

image from book
Figure 5-1: Simplified order entry database schema

What Data Should be Loaded?

These factors affect the data that is to be loaded.

Data Selection

You should load only as much data as the user needs to work disconnected. If the user is a traveling salesperson, you might need to load his customer list and all of the information related to each customer. If you are building a Web site that uses disconnected data, you might only retrieve the data for the currently logged on customer. In almost all cases, you need to retrieve only a subset of the data in the database.

Data Size

The size of the data affects the load time, update time, and client memory requirements. Keep in mind that DataSet objects are memory based, so be careful about the amount of data you are retrieving. If you're not sure if you should retrieve certain data, don't retrieve it. This is a much better approach than arbitrarily retrieving data that might never be used.

Partitioning Data

It might be beneficial to break the data into multiple DataSet objects, based on what you think the DataSet object should represent. For example, you might think of a DataSet object as an object that contains all of the data for a single customer. In that case, the DataSet object might contain all five of the tables defined in Figure 5-1, but the TblCustomer and TblAddress tables contain only a single row with the information for a particular customer, and the TblOrder and TblOrderItem tables only have rows that relate to that customer.

What do you do with the TblProduct table? If the DataSet object is supposed to represent a complete snapshot of this customer at a point in time, you might want to include the TblProduct table but include only the product rows that relate to the TblOrderItem table. This would allow you to view all of the order information, including the products purchased. However, if you are going to allow products to be deleted from the TblProduct table, even if they have been used in orders, TblOrderItem should contain all of the data for the product being purchased, and the TblProduct table can be excluded from the customer DataSet object.

Similarly, if you need to be able to add more orders for different products, you must have the complete TblProduct table, so again you might want to place it into its own DataSet object. This would allow you to transfer product lists independently from the customer or customers. Remember that a foreign key constraint cannot be created between DataTable objects that are in different DataSet objects; however, in this case it's ok because you want to be able to delete obsolete products from the TblProduct table without being forced to delete references to them from the TblOrderItem table.

Therefore, the TblProduct table should be in its own DataSet object, and there should either be a DataSet object for each customer or one DataSet object that contains all of the customer data. In the scenario where Joe is a traveling salesperson, one DataSet object will contain the data for all of Joe's customers, and the second DataSet object will contain the products, as shown in Figure 5-2.

image from book
Figure 5-2: The customer data and product data should be in separate DataSet objects.

Choosing the Primary Key

The primary key is the column or combination of columns that uniquely defines a row. No column in the primary key can contain a null value. The primary key provides a means for the database engine to retrieve a specific row from a table in your database. The primary key is also used to enforce referential integrity. When you are working with disconnected data, you must eliminate any possibility of duplicate primary keys when multiple users are inserting data at the same time.

Intelligent, Natural Primary Keys vs. Surrogate Primary Keys

An intelligent key is a key that is based on the business data being represented. An example of an intelligent key is a Stock Keeping Unit (SKU) that is defined as a 10-character field (defined as CHAR(10) in the database). This SKU might be built as follows: the first four characters are the vendor code, the next three characters are a product type code, and the last three characters are a sequence number.

A natural key is a column or combination of columns that naturally exists in the business data and is chosen to uniquely identify records. For instance, an existing business process might define a social security number to identify a hospital patient.

Although intelligent and natural keys are different, they are both created from business-related columns that are normally viewed by the user, and the arguments for and against them are the same for the purposes of this discussion. I will refer to these keys collectively as Intelligent-Keys.

Surrogate primary keys are system-generated key values that have no relationship to the business data in the row, which makes them dumb keys. I'll refer to them generically as surrogate keys. One example of this kind of key is an auto-increment column where the value is set to 1, 2, 3, and so on as new rows are added. Auto-increment columns in Microsoft SQL Server are referred to as identity columns. I will refer to this key as an Identity-Key. Another example of system-generated key values is the use of a globally unique identifier (GUID) that is set to a value that is guaranteed to be unique based on the algorithm that creates these values. I will refer to this as a GUID-Key.

Which type of primary key do the experts recommend? It depends on which expert you ask. Each approach has major advantages and disadvantages that you should understand before you make your choice.

Figure 5-3 shows an example of Intelligent-Key and surrogate primary key implementation. This example contains a table for authors and books and a join (many-to-many) table because an author can write multiple books and a book can be written by multiple authors.

image from book
Figure 5-3: Example of intelligent and surrogate primary key implementations

Notice that the surrogate primary key implementation contains an extra Id column in the author and book tables because surrogate primary keys are system generated and have no relation to the row data. Surrogate primary keys should never be visible to the user. Here are the differences between the Intelligent-Key and surrogate key implementations.

Data Size Size itself is not too important, but the bandwidth that data consumes when it is transferred between the database and client is important. The surrogate implementation adds a column to each primary table. This can substantially increase data size. If the surrogate column is a GUID, the added column is 16 bytes per row. If the added column is an auto-increment column, the added column is based on the size of the numeric data type that you select (4 bytes for int, 8 bytes for long). It's common to see join tables that contain a very large quantity of rows, so be sure to consider the size difference between intelligent and surrogate keys when you analyze the overall database size difference. Also, the primary key enforces unique ness by creating a unique index, so be sure to consider this as well. Table 5-1 compares the sizes that result when choosing each key type.

Table 5-1: Example Primary Key Sizes

Description

Intelligent-Key

Identity-Key (int)

GUID-Key

1000 Authors

9 bytes/SSN = 9,000 bytes

4 bytes/int = 4,000 bytes

16 bytes/GUID = 16,000 bytes

3000 Books

10 bytes/ISBN = 30,000 bytes

4 bytes/int = 12,000 bytes

16 bytes/GUID = 48,000 bytes

10,000 AuthorBooks

19 bytes/key = 190,000 bytes

8 bytes/key = 80,000 bytes

32 bytes/key = 320,000 bytes

Subtotal

229,000 bytes

96,000 bytes

384,000 bytes

Index

229,000 bytes

96,000 bytes + 9000 SSN + 30,000 ISBN = 135,000

384,000 bytes + 9000 SSN + 30,000 ISBN = 423,000

Total Size

458,000 bytes

231,000 bytes

807,000 bytes

The apparent winner in this scenario is the Identity-Key, but remember that the maximum value of the int is 231 -1 = 2,147,483,647. This should be good for most applications, but you might need to use a long data type for large row quantities. Notice the size calculations for the index category. The implementation of a surrogate primary key still requires a unique index on the SSN and ISBN columns to enforce uniqueness on these columns.

Key Visibility Surrogate keys are not intended to be seen by the user, whereas intelligent keys are seen and understood by the user. Your custom applications can hide surrogate keys, but database tools cannot. This means that people who use database tools must understand the use of surrogate keys. The Intelligent-Key is therefore the winner in this category.

Modifying Keys Primary keys are difficult to change because if you modify the key, the change must be propagated to the child tables. This is where the surrogate key shines and the intelligent key suffers. Why? Surrogate keys are not intended to be displayed to the user, so there is never a need to modify them. Intelligent keys consist of business data that is visible to the user, so you must always allow this data to change.

Surrogate int primary keys also need to change to ensure uniqueness (as described shortly), but surrogate GUID primary keys never need to change. This is the primary reason why I like to use surrogate GUID keys.

Quantity of Joins In some cases, you can reduce the number of joins with intelligent keys. For example, if you want to run a report showing the books by each author and containing just the author's SSN and the book's ISBN fields, you can simply query the TblAuthorBook join table when intelligent primary keys are implemented. When surrogate primary keys are implemented, you have to join the TblAuthor, TblAuthorBook, and TblBook tables to get this information. The intelligent primary keys win in this category, but consider how seldom you need just these two columns without also needing additional information such as the author's name or book title.

SQL Complexity Intelligent primary keys are often implemented using multiple columns to achieve uniqueness. SQL queries can be much more complicated if such a compound intelligent key is involved. Although you might have more joins with surrogate keys, as described previously, surrogate keys can be easier to work with because they don't use compound keys (except possibly on join tables). If you compare the two surrogate key types, the Identity-Key implementation is easier to write queries for than the GUID-Key, but once you get familiar with some of the tricks of working with GUID data types (as described later), you'll find that the GUID-Key is only slightly more difficult to work with than the Identity-Key.

Ensuring Uniqueness When Disconnected It's essentially impossible to ensure uniqueness with intelligent keys when the user is entering data while disconnected. The problem is that someone else could enter matching information, resulting in a conflict when you attempt to send the added rows to the database server. One might argue that using surrogate keys can mask the problem, but don't forget that you still have the ability to create unique indexes on fields such as social security number or vehicle identification number, which can throw an exception if duplicate entries are added.

When surrogate int keys are used, the trick to managing the numbering on the primary key columns is to set the AutoIncrement property in the disconnected DataSet object to true, the AutoIncrementStep (increment) to -1 (negative one), and the AutoIncrementSeed (starting value) to -1, which means that new rows will be added starting with a value of -1 and will continue to increment by -1. The negative values are considered to be disconnected placeholders, and there is no chance of conflict with the server's identity column settings because the server assigns only positive numbers. The following SQL command shows the insertion followed immediately by querying for the inserted row. The information that is returned is used to update the placeholders (negative keys) with the value that the database created.

image from book

SQL Insert Command

INSERT INTO [TBLAUTHOR] ([SSN], [LastName], [FirstName])    VALUES (@SSN, @LastName, @FirstName) SELECT Id, SSN, LastName, FirstName FROM TblAuthor    WHERE (Id = SCOPE_IDENTITY()) 
image from book

The SCOPE_IDENTITY function returns the value of the author's Id that was just inserted. Be careful not to use the @@IDENTITY function because this function returns an incorrect value if an insert trigger was fired and it inserted one or more rows into a table with an identity column.

Because the data must be retrieved from the server to update the placeholders in the disconnected data, you must consider the performance impact of updating the primary key values. What happens if you update the disconnected data key with the value that was created at the server? All of the child data must be updated to reflect the change as well; you can do this by enabling cascading updates on the relationships. This creates another performance hit, especially for large DataSet objects.

When surrogate GUID primary keys are used, there is no need to change the key once it has been set. The main problem is setting the value. The following code snippet shows how to initialize the GUID.

image from book

Visual Basic

Private Sub Form1_Load(ByVal sender As System.Object, _       ByVal e As System.EventArgs) Handles MyBase.Load    For Each dt As DataTable In salesSurrogateGuidKeyDs.Tables       If (not dt.Columns("Id") Is Nothing) Then          AddHandler dt.TableNewRow, addressof InitializeGuid       End If    Next End Sub Private Sub InitializeGuid(ByVal sender As Object, _       ByVal e As DataTableNewRowEventArgs)    If (TypeOf e.Row("Id") Is DBNull) Then       e.Row("Id") = Guid.NewGuid()    End If End Sub 
image from book

image from book

C#

public Form1() //constructor {    InitializeComponent();    foreach (DataTable dt in sales_SurrogateGuidKeyDs.Tables)    {       if(dt.Columns["Id"] != null)          dt.TableNewRow += new DataTableNewRowEventHandler(InitializeGuid);    } } private void InitializeGuid(object sender, DataTableNewRowEventArgs e) {    if(e.Row["Id"] is DBNull)       e.Row["Id"] = Guid.NewGuid(); } 
image from book

Because TblAuthor and TblBook have primary keys with the same name ("Id"), a method called InitializeGuid is coded for creating new GUIDs and the TableNewRow event of these tables is wired to call this method. It's usually a good idea to give all surrogate primary keys the same name, as I did here. There is no need to create cascading relationships with this implementation. This is why surrogate GUID primary keys win in this category.

Migrating Data to Other Databases Migrating data from one database to another when the Identity-Key is implemented requires some work. Imagine that your tables have keys with values 1 through n and these values are also placed in foreign keys throughout the database. You want to take that data and merge it into a database with data that uses the same numbers. To solve this problem, you need to renumber all of the Identity columns.

With the GUID-Key implementation, migrating data is a simple matter of copying the data from one database to the other, which means the GUID-Key wins in this category.

And the Winner Is

I just finished a very large project where the GUID-Key was implemented, and I have also worked on large projects using Intelligent-Key and Identity-Key implementations. Table 5-2 summarizes how each approach rates in a number of categories that are important in terms of performance, size, and ease of use. Based on my experience with these approaches, I assigned scores on a weight of 0 to 100 percent, where a weight of 100 is most important. Next I assigned a first-place (1), second-place (0.5), and third-place (0) score to the key types and multiplied that score by the weight to obtain a weighted score for each item.

Table 5-2: Final Scores Based on Categories and Weights

Weighted Scores

Category and Weight

Intelligent-Key

Identity-Key

GUID-Key

Data Size = 25%

12.5%

25%

0%

Key Visibility = 5%

5%

2.5%

0%

Modifying Keys = 20%

0%

10%

20%

Quantity of Joins = 5%

5%

2.5%

2.5%

SQL Complexity = 5%

0%

5%

2.5%

Ensuring Uniqueness = 25%

0%

12.5%

25%

Migration = 15%

7.5%

0%

15%

Total = 100%

30%

57.5%

65%

Based on my weighting, the GUID-Key best satisfies the greatest number of the most important categories. If you feel differently about any of the items, try modifying the weights to see whether you get a different result. Note that none of these primary key implementations gets a perfect, 100 percent rating. My general feeling is that the GUID-Key implementation is the best approach for disconnected data applications. Is there a place for the Intelligent-Key implementation? Yes, it might be the best approach for data warehouse applications because data warehouse applications are typically designed to provide high performance read-only access with minimum joins. Since the data is read-only, there is little concern regarding key modification. With a bit of tweaking, Table 5-3 provides the scores based on different weights.

Table 5-3: Data Warehouse Scores Based on Categories and Weights

Weighted Scores

Category and Weight

Intelligent-Key

Identity-Key

GUID-Key

Data Size = 25%

12.5%

25%

0%

Key Visibility = 10%

10%

0%

0%

Modifying Keys = 5%

0%

2.5%

5%

Quantity of Joins = 25%

25%

0%

0%

SQL Complexity = 10%

10%

5%

5%

Ensuring Uniqueness = 5%

0%

0%

5%

Migration = 20%

10%

0%

20%

Total = 100%

67.5%

32.5%

35%

Use these tables as guidelines, and be sure to consider any additional categories that your project may have.

 


Programming Microsoft ADO. NET 2.0 Applications. Advanced Topics
Linux Application Development (2nd Edition)
ISBN: 735621411
EAN: 2147483647
Year: 2004
Pages: 85
Authors: Michael K. Johnson, Erik W. Troan
BUY ON AMAZON

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net