Hack 79. Identify Updates Uniquely

Sometimes you need a primary key for a table, but there are no suitable candidates. A system-generated identifier can do the job.

At times, it is useful to batch the updates you need to make to a central server. For example, say that during the day, you accumulate changes to the database on several "branch" servers. At the close of business, you send the updates, in batches, from each branch to the central server. All of the batches can be processed during a quiet period.

This means that the data on the central server is always a little out of date, but it can make a big difference to the performance of a system. A distributed system such as this does not rely on the central server being available all the time, so it is robust. The branches can continue working during the day, even if the central server or the connection to it fails. It should perform a good deal faster as well. If you rely on one central server it is likely to be the bottleneck that limits your maximum rate of transactions.

You can have daily updates sent from the central server back to the branches, which means that each branch can be certain that its data is never more than one day old.

Consider a batch update system where a central server holds account details (Table 10-6) and each branch server holds a buffer of updates (Table 10-7).

Table 10-6. The account table held on a central server
accId	balance
Dibble	$100
Benny	$100
Choo Choo	$100
TC	$100

Table 10-7. The batch table sent daily by each branch
whn	accId	credit
2006-07-14	Dibble	5
2006-07-14	Benny	+5
2006-07-14	Dibble	5
2006-07014	Choo Choo	+5

In a scheme such as this, it is important that you record only changes, not actual values. In this example, you can see additions and subtractions relative to the existing balance. If you attempted to record the new balance in a batch such as this, one branch batch might overwrite the change from another branch batch.

The batch table gets sent up to the central server. At the central server you might apply the changes to the balance table with a single UPDATE statement such as this one:

UPDATE account
 SET balance = balance + (SELECT SUM(credit) FROM batch
 WHERE batch.accId = account.accId)

This will be slow because it has to update every account, whether it is due an update or not. Much worse, it will set accounts to NULL if they don't show up in the batch table. That is because SUM(credit) will be NULL where there are no rows in the batch table and amount+NULL is NULL. You must do a more selective UPDATE based on only the accounts that need changing:

mysql> UPDATE account
 -> SET balance = balance + (SELECT SUM(credit) FROM batch
 -> WHERE batch.accId = account.accId)
 -> WHERE accId IN (SELECT accId FROM batch);
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3 Changed: 3 Warnings: 0

mysql> SELECT * FROM account;
+-----------+---------+
| accId | balance |
+-----------+---------+
| Dibble | 90 |
| Benny | 105 |
| Choo Choo | 105 |
| TC | 100 |
+-----------+---------+

10.3.1. Cope with Duplicate Batches

When all goes well the batches are sent from the branch servers to the central server once per day. But when things go wrong, all or part of a batch may be sent twice. This might happen due to a mistake at the branch or due to hardware or network problems. An automated script might send the batch twice if a server reboot occurred during, or just after, the file was transmitted.

To guard against this possibility, the central machine can keep a record of every transaction ever received. If it gets the same transaction twice it can simply mark the duplicate and ignore it.

To do this you need every transaction to have a unique identifier, but in this example there is not enough data in each transaction to uniquely identify it.

10.3.2. Create Transactions at the Branch

For the batch processing to be completed reliably you need to attach a unique identifier to every transaction. You could use a branch identifier together with an auto-generated sequence number, but there can be complications with this solution. It might be that some branches have sub-branches, for example.

A globally unique identifier is essentially a random number that is sufficiently long that there is no chance of the same value ever being repeated. These can be generated at the branch as each transaction is entered. The batch2 table includes space for the unique identifier:

CREATE TABLE batch2
(tranId CHAR(36)
,whn DATE
,accId VARCHAR(20)
,credit INT
);

The UUID (universally unique identifier) function in MySQL is nondeterministic; it will give a different value for every call. Two different machines cannot generate the same value.

mysql> INSERT INTO batch2 VALUES (UUID( ),DATE '2006-07-14','Dibble',-5);
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO batch2 VALUES (UUID( ),DATE '2006-07-14','Benny',5);
Query OK, 1 row affected (0.00 sec)
...
mysql> SELECT * FROM batch2;
+--------------------------------------+------------+-----------+--------+
| tranId | whn | accId | credit |
+--------------------------------------+------------+-----------+--------+
| 89cb1ef6-8c80-1029-8490-00e08125c735 | 2006-07-14 | Dibble | -5 |
| 89cba920-8c80-1029-8490-00e08125c735 | 2006-07-14 | Benny | 5 |
...

In SQL Server, you can use the NewId( ) function to obtain a unique, 36-character value; in Oracle, the 32-character SYS_GUID( ) is available. PostgreSQL did not have such a function as of this writing, but it is scheduled to be included in version 8.2.

10.3.3. Update at the Central Server

You can now send the batch2 table to the central server.

The central server needs to keep a table, batchHistory, to record all transactions that have been sent. This table will be big, but not too big, because you can afford to remove very old records. You need to allow for batches that are one or two days old in case of emergency, but you should reject any batch older than a specified cutoff period. The examples that follow assume you can afford to remove two-week-old records from the batchHistory table.

The central server should complete three operations in a single transaction:

Delete the duplicates from the batch.
Update the balance table.
Add all batch records to the batchHistory table.

Here's the processing that needs to be performed on the central server:

mysql> DELETE FROM batch2
 -> WHERE tranId IN (SELECT tranId FROM batchHistory);
Query OK, 0 rows affected (0.00 sec)
mysql> UPDATE account
 -> SET balance = balance +
 -> (SELECT SUM(credit)
 -> FROM batch2
 -> WHERE batch2.accId=account.accId)
 -> WHERE accId IN (SELECT accId FROM batch2);
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3 Changed: 3 Warnings: 0
mysql> INSERT INTO batchHistory
 -> SELECT * FROM batch2;
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM account;
+-----------+---------+
| accId | balance |
+-----------+---------+
| Dibble | 90 |
| Benny | 105 |
| Choo Choo | 105 |
| TC | 100 |
+-----------+---------+

The balances have been updated as before, but this time the batchHistory table has a copy of every row in the batch, so if the same sequence is attempted, no updates will get through:

mysql> DELETE FROM batch2
 -> WHERE tranId IN (SELECT tranId FROM batchHistory);
Query OK, 4 rows affected (0.00 sec)

mysql> UPDATE account
 -> SET balance = balance +
 -> (SELECT SUM(credit)
 -> FROM batch2
 -> WHERE batch2.accId=account.accId)
 -> WHERE accId IN (SELECT accId FROM batch2);
Query OK, 0 rows affected (0.00 sec)
Rows matched: 0 Changed: 0 Warnings: 0

mysql> INSERT INTO batchHistory
 -> SELECT * FROM batch2;
Query OK, 0 rows affected (0.00 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> SELECT * FROM account;
+-----------+---------+
| accId | balance |
+-----------+---------+
| Dibble | 90.00 |
| Benny | 105.00 |
| Choo Choo | 105.00 |
| TC | 100.00 |
+-----------+---------+

Notice that this time all rows in the batch were deleted, so no updates were made.

SQL Fundamentals

Joins, Unions, and Views

Text Handling

Date Handling

Number Crunching

Online Applications

Organizing Data

Storing Small Amounts of Data

Locking and Performance

Reporting

Users and Administration

Wider Access

Index