Section 5.8. The Best Way to Partition Data

5.8. The Best Way to Partition Data

Never forget that what dictates the choice of a nonstandard storage option such as partitioning is the global improvement of business operations. It may mean improving a business process that is perceived as being of paramount importance to the detriment of some other processes. For instance, it makes sense to optimize transactional processing that takes place during business hours at the expense of a nightly batch job that has ample time to complete. The opposite may also be true, and we may decide that we can afford to have very slightly less responsive transactions if it allows us to minimize a critical upload time during which data is unavailable to users. It's a matter of balance.

In general, you should avoid unduly favoring one process over another that needs to be run under similar conditions. In this regard, any type of storage that positions data at different locations based on the data value (for example both clustering indexes as well as partitioning) are very costly when that value is updated. What would have previously been an in situ update in a regular table, requiring hardly more than perhaps changing and shifting a few bytes in the table at an invariant physical address, becomes a delete on one part of the disk, followed by an insert somewhere else, with all the maintenance operations usually associated with indexes for this type of operation.

Having to move data when we update partition keys seems, on the surface, to be a situation best avoided. Strangely, however, partitioning on a key that is updated may sometimes be preferable to partitioning on a key that is immutable once it has been inserted. For example, suppose that we have a table being used as a service queue. Some process inserts service requests into this table that are of different types (say type T₁ to type T_n). New service requests are initially set to status W, meaning "waiting to be processed." Server processes S₁ to S_p regularly poll the table for requests with the W status, change the status of those requests to P (meaning "being processed"), and then, as each request is completed its status is set to D for "done."

Let's further suppose that we have as many server processes as we have request types, and that each server process is dedicated to handling a particular type of request. Figure 5-5 shows the service queue as well as the processes. Of course, since we cannot let the table fill with "done" requests, there must be some garbage-collecting process, not shown, that removes processed requests after a suitable delay.

Figure 5-5. A service queue

Each server process regularly executes a select (actually, a select ... for update) query with two criteria, the type, which depends on the server, and a condition:

     and status = 'W'

Let's consider alternative ways of partitioning the service queue table. One way to partition the table, and possibly the most obvious, is to partition by request type. There is a big advantage here should any server process crash or fall behind in one way or another. The queue will lengthen for that process until it finally catches up, but the interruption to the processing of that queue will have no influence on the other processes.

Another advantage of partitioning by request type is that we avoid having requests of any one type swamp the system. Without partitioning, the polling processes scan a queue that under normal circumstances contains very few rows of interest. If we have a common waiting line and all of a sudden we have a large number of requests of one type and status, all the processes will have more requests to inspect and therefore each will be slowed down. If we partition by type, we establish a watertight wall between the processing of different types.

But there is another possible way to partition our service queue table, and that is by status. The downside is obvious: any status change will make a request migrate from one partition to the next. Can there be any advantage to such migration? Actually, there may indeed be benefit in this approach. Everything in partition W is ready and waiting to be processed. So there is no need to scan over requests being processed by another server or requests that have already been processed. Therefore, the cost of polling may be significantly reduced. Another advantage is that garbage collection will operate on a separate partition, and will not disturb the servers.

We cannot say definitively that "partitioning must be by type" or "partitioning must be by status." It depends on how many servers we have, their polling frequency, the relative rate at which data arrives, the processing time for each type of request, and how often we remove processed requests, and so on. We must carefully test various hypotheses and consider the overall picture. But it is sometimes more efficient for the overall system to sacrifice outright performance for one particular operation, if by doing so other, more frequently running processes are able to obtain a net advantage, thus benefiting the global business operations.

There may be several ways to partition tables, and the most obvious is not always the most efficient. Always consider the global picture.