Section 5.6. The Double-Edged Sword of Partitioning

5.6. The Double-Edged Sword of Partitioning

Despite the fact that partitioning spreads data from a table over multiple, somewhat independent partitions, data-driven partitioning is not a panacea for resolving concurrency problems. For example, we might partition a table by date, having one partition per week of activity. Doing so is an efficient way to spread data for one year over 52 logically distinct areas. The problem is that during any given week everybody will rush to the same partition to insert new rows. Worse, if our partitioning key is the current system date and time, all concurrent sessions will be directed towards the very same data block (unless some structural implementation tricks have been introduced, such as maintaining several lists of pages or blocks where we can insert). As a result, we may have some very awkward memory contention. Our large table will become a predominantly cold area, with a very hot spot corresponding to most current data. Such partitioning is obviously less than ideal when many processes are inserting concurrently.

If all data is inserted through a single process, which is sometimes the case in data-warehousing environments, then we won't have a hot spot to contend with, and our 52-week partitioning scheme won't lead to concurrency problems.

On the other hand, let's assume that we choose to partition according to the geographical origin of a purchase order (we may have to carefully organize our partitioning if our products are more popular in some areas and suffer from heavier competition elsewhere). At any given moment, since sales are likely to come from nowhere in particular, our inserts will be more or less randomly spread over all our partitions. The performance impact from our partitioning will be quite noticeable when we are running geographical reports. Of course, because we have partitioned on spatial criteria, time-based reports will be less efficiently generated than if we had partitioned on time. Nevertheless, even time-based queries may, to some extent, benefit from partitioning since it is quite likely that on a multiprocessor box the various partitions will be searched in parallel and the subsequent results merged.

There are therefore two sides to partitioning. On the one hand, it is an excellent way to cluster data according to the partitioning key so as to achieve faster data retrieval. On the other hand, it is a no-less-excellent way to spread data during concurrent inserts so as to avoid hot spots in the table. These two objectives can work in opposition to one another, so the very first thing to consider when partitioning is to identify the major problem, and partition against that. But it is important to check that the gain on one side is not offset by the loss on the other. The ideal case is when the clustering of data for selects goes hand in hand with suitably spread inserts, but this is unfortunately not the most common situation.

Data partitioning can be used to scatter or cluster your data: it all depends on your requirements.