B-trees

   

Two types of searches are possible when you're retrieving data from your database. An internal search occurs when you're searching a file that fits into your computer's internal memory. When you need to retrieve data from a very large file, though, the search is called an external search because the entire file doesn't fit into the available RAM. B-tree structures lend themselves nicely to external searchingthey make it possible both to search and to update a large file with guaranteed efficiency.

All DBMSs depend on B-treesthey are the default structures built when you create an index and are also the structures that will be used to enhance retrieval in most cases. Each of the Big Eight use the B-tree. Informix alone says explicitly that it uses a variant called the B+-tree, but the rest of the Big Eight do too. None of the Big Eight use the other important variant, the B*-tree.

A B-tree is essentially just a sorted list of keys, which are {data, pointer or row locator} combinations. The actual structure of a B-tree consists of a number of nodes stored as items in the B-tree's file. Each node contains a portion of the sorted list of keys in the B-tree, along with pointers to other nodes. The root node for a given B-tree is where all searches through that tree begin. Figure 9-1 shows a simplified example of this principle.

Figure 9-1. Searching a B-tree

graphics/09fig01.gif

Characteristics of B-trees

A B-tree of order m is a tree with the following characteristics:

  • Every node has at most m children.

  • Every node, except for the root and the leaves , has at least (m/2) children.

  • The root has at least two children, unless it is a leaf.

  • All leaves appear on the same level and carry no information.

  • A nonleaf node with k children contains (k-1) keys.

  • A leaf is a terminal node (one with no children).

A B+-tree is an improved B-tree that allows you to efficiently search a file both sequentially and randomly . All records of the file are stored in the B+-tree's leaves, with only a few of the keys duplicated in the branch nodes. A B+-tree's leaves are always at least half fulla new key enters the nonleaf part of the tree whenever a leaf splits . B+-trees are B-trees with two common features: (a) all keys are in the leaf nodes even if that means they are repeated in more than one layer, and (b) there are pointers from leaf to leaf.

A B*-tree is an even better form of B-tree. It has these characteristics:

  • Every node except the root has at most m children.

  • Every node, except for the root and the leaves, has at least ((2m-1)/3) children.

  • The root has at least two and at most (2[(2m-2)/3]+1) children.

  • All leaves appear on the same level.

  • A nonleaf node with k children contains (k-1) keys.

The important characteristic is the second onethe fact that at least two- thirds of the available space in every node is always used. This uses space efficiently and makes searches faster. On the downside, it slows down INSERT because nodes tend to need more attention as they fill up. B*-trees are B-trees with one difference: A split is not a 50/50 fission. Instead the DBMS will attempt to shift keys into adjoining leaf blocks so that pages are fuller and the need to make new pages is less frequent. This makes splits affect multiple pages. B*-tree use is very rare.

Note: We will follow common usage and use "B-tree" to mean the "B+-tree" variant of the B-tree, as most DBMS vendors do.

In a B-tree, records are stored in locations called leaves. The name derives from the fact that records always exist at endpointsthere is nothing beyond them. The maximum number of children per node is known as the order of the tree. A practical B-tree can contain thousands, millions, or billions of records.

The four characteristics that describe a B-tree are: sortable, selective, subset, balanced.

  • Sortable because you travel through a B-tree's keys in order, as when you use >= or BETWEEN in a WHERE clause, or when you use ORDER BY to sort a result set. A BLOB, which has no inherent sortability, is not appropriate for a B-tree index.

  • Selective because you want to get only some of the rows. Selectivity is defined as:

    graphics/09equ01.gif


    and is often expressed as a percentage. Thus, for a sex column in a 100-row table, selectivity is 2%there are only two possible distinct values and ( 2/100 ) is 2%. The usual recommendation is not to make a B-tree if selectivity is a low value, so sex is an example of an inappropriate choice.

  • Subset because the key value (the part of an index key other than the pointer) is completely derived from the column value in the data row. Thus, if a row is {'Abc', 15} , then a full index also has {'Abc', 15} though possibly truncated and not necessarily in that order.

  • Balanced because the distance from the root to the leaf is always the same. Figure 9-2 shows an example of an unbalanced tree.

    Figure 9-2. An unbalanced tree

    graphics/09fig02.gif

Confusingly, it's traditional to diagram a tree upside downthe root is sometimes called the "top layer" and the leaf is the bottom. The interesting thing to note about the tree shown in Figure 9-2 is that a direct pointer leads from the root to leaf page #1, but an intermediate node is between the root and leaf page #2, which is what makes the tree unbalanced. Therefore, eponymously, Figure 9-2 is not a B-treeB-trees are always balanced.

In a Btree, all the keys are grouped inside fixed- size pages. The keys in the root page point to intermediate pages, the keys in the intermediate pages point to leaf pages, and the keys in the leaf pages point to data rows. That, at least, is what happens when three levels (or three layers ) are in the index, and the index is not clustered (the "clustered index" exception is something we'll talk about later). Figure 9-3 shows a three-level B-tree.

Figure 9-3. Three-level B-tree showing root, intermediate, and leaf pages

graphics/09fig03.gif

Searching a B-tree

Suppose you have the table and index shown in Figure 9-4 (we'll call this our example database) and execute this SQL statement:

 SELECT * FROM Table1   WHERE column1 = 'OTTO' 
Figure 9-4. Example database with B-tree index

graphics/09fig04.gif

To evaluate this query, the DBMS starts by scanning the index's root page. It scans until it's pointing to a key value that is greater than or equal to OTTO (in this case, SAM ). Now OTTO , if it exists, is on the path that is pointed to by the key just before that (in this case, JOE ), so the DBMS loads leaf page #2. Then it scans leaf page #2, looking again for the first value that is greater than or equal to OTTO (in this case, OTTO ). This second scan tells the DBMS to load data page #4. Once it has done so, it finds OTTO in the second row in that page.

Two things are worth noting about this system:

  • The DBMS is scanning keys that are known to be in order, so it can do a binary searchnot a sequential searchof the keys in the page. (A binary search is possible even if there are variable-size keys because the page header contains offset pointers.) Thus it makes no difference whether keys are at the start or at the end of the pagesearch time is the same.

  • Your search didn't specify how many rows you wanted. Therefore, unless the DBMS somehow knows that OTTO can only occur once, it will continue scanning. In this case, it will follow the leaf pointer and get the next leaf blocka wasted disk access, because the next leaf page actually begins with SAM .

The notes here are about exceptions and details. The fundamental point is much simpler: The DBMS needed two page reads to find the key. This figure is fixedthe number of index levels equals the number of reads, always, except that you can be fairly confident that the root page is in a cache. (There is actually one exception: If the DBMS is scanning leaf pages sequentially, there is no need to visit intermediate or root pages so the reads are fewer.)

For example, suppose you execute this query:

 SELECT column1 FROM Table1   ORDER BY column1 

and the DBMS decides to resolve the query by going through all the keys in the index, sequentially. In that case, with most DBMSs, the transit is from leaf page to leaf page directly and not via the upper (i.e., intermediate or root) index levels. This is because there are pointers from one leaf page forward to the next leaf or from one leaf page back to the previous leaf.

The critical thing, then, is the number of layersnot the total index sizeand that's why it's good to have big pages. Typically an index page size is 8KB. It's this mandateKeep the number of layers small!that influences several of the design and updating decisions that programmers must make. A rule of thumb is that the number of layers should never exceed five. If there are more than five layers, it's time for drastic measures like partitioning. (This rule is somewhat arbitrary, as all "rules of thumb" arewe're just passing on the general consensus from many diverse sources.)

Inserting into a B-tree

When you INSERT a new row, or UPDATE with a new value for an indexed column, the result includes the insertion of a new key value into the index. Usually the insert of a new key value is simple:

  • The DBMS scans the index pages until it finds a key value that is greater than the new key value.

  • Then it shifts forward all following data in the page to make room for the new key value.

  • Then it copies the new value into the page. (The shifting should mean that it's slightly slower to insert a key at the start of a page than at the endwhich means that it's bad to INSERT in a descending sequence. However, the difference is barely measurable.)

What if a key value insert isn't simple? For example, what happens with our example database (Figure 9-4) if the following SQL statements are executed?

 INSERT INTO Table1 VALUES ('MARY') INSERT INTO Table1 VALUES ('OMAR') 

The page doesn't contain enough room on which to fit these new key values. To make them fit, the DBMS must take some keys out of the current page and put them in a new pagefor example, at the end of the index file. [1] This process is called splitting. The rules for splitting are as follows :

[1] This, of course, is a very simplistic description of what could actually happen. As with data pages, allocation of new index pages is usually more complex; involving the DBMS's freelist or other storage hierarchy rules.

  • At the leaf level, splits should only occur when there is no chance of fitting the new key in the current page.

  • Splits should be 50/50. That is, after the split happens and the new key is added, the total number of bytes in the keys of the current page should be about the same as in the new page. There are exceptions to this rule: (a) when the new key would otherwise go at the end of the page, the current page is left alone and only the new key goes into the new page and (b) the definition of a "full" page depends on switches like PCTFREE/FILLFACTOR.

  • A pointer to the new leaf page must be added into the appropriate page at the intermediate level.

  • For reasons that have to do with locking, a split must not cascade. That is, a split on the leaf must not cause a split in the intermediate level. To ensure this, the DBMS will split a nearly full intermediate-level page first, before it reads the leaf page. This means that unnecessary splits are sometimes made at high levels.

  • If variable-length fields are present, the calculations become more complicated because the split cannot occur right in the middle of the page, and because the new key might be so big that a single split will not free enough room.

Figure 9-5 shows what the example database looks like after the INSERTs.

Figure 9-5. Example database after INSERTs [*]

graphics/09fig05.gif

[*] Figure 9-5 shows the same database as in Figure 9-4, after two INSERTs and a split. Notice that not all leaf pages are full, and notice that leaf page #4 should logically follow leaf page #2.

The effect of INSERT is much like the effect of dripping grains of sand, one grain at a time, onto the top of a pile. For a long time nothing happens, but eventually the addition of one more grain causes an avalanche, the shape of the pile changes, and a new equilibrium is reached.

Because of splits, the speed of data-change statements (INSERT, UPDATE, DELETE) may vary unpredictably. To reduce that unpredictability , you can instruct the DBMS to leave a little bit of space in every page and therefore make splits unlikely . This is done with PCTFREE or FILLFACTOR (see the section "Free Page Space" in Chapter 8, "Tables"). But PCTFREE/FILLFACTOR and their ilk affect only the free space left during index creation or rebuilding. During data-change operations, splitting occurs only if there is absolutely no chance of fitting the new key in the page.

Because of the rules of splits, a page will be less than 100% full after a large number of random key insertions (because of waste space at the end of the page), but more than 50% full (because a split makes pages 50% [2] full and then there's room for some INSERTs that won't cause a split). The rule of thumb to follow here is to assume pages are 70% full if insertions are random, or more than that if insertions are done in an ascending sequence.

[2] This is an approximation . As stated earlier, the definition of a "full" page is influenced by switches like PCTFREE/FILLFACTOR.

Deleting from a B-tree

There are two ways to delete keys from B-trees: the obvious way, and the Microsoft/Oracle way.

If the DBMS treats deletions as merely insertions in reverse, the deletion rules are:

  • Scan the index until the key whose value equals the deleted value is found. This involves comparing both key value and data pointer because sometimes the same value can occur in different rows.

  • Shift all following keys in the page upward over the key being deleted.

  • If the key was the first in the leaf page, update the key value in the intermediate page. Some cascading may occur, and there is an opportunity for a reverse splitthat is, two logically adjacent pages could now be merged if both are less than 50% full. Actually, though, no such thing happens.

Microsoft and Oracle don't do deletions the obvious way. For these DBMSs, the deletion rules are:

  • Scan the index to find the key so far so good.

  • Mark the key as deletedbut don't move anything! The key is still there and is still used for ordering, but it's become a "ghost" row.

The obvious advantage of the Microsoft/Oracle way is that the expense of shifting and cascading is eliminatedfor now. (An additional and perhaps more important reason has to do with a special type of lock called a "key range lock," which we'll discuss in Chapter 15, "Locks.") That's all very well, and the space is reused when a new value will fit in the same place or when a low-priority background thread clears up the page. However, it does mean you shouldn't repeatedly change an indexed column to an ever- newer value. Consider this snippet of code, part of a program to make an index grow indefinitely:

 for (i=0; i<1000000; ++i) {   num=rand();   EXEC SQL UPDATE Table1 SET column1 = :num;   } 

We ran a program containing a loop like this after inserting 10,000 rows into Table1 . When the program finished, the index had at least doubled in size. So although the old tipDo DELETE before INSERTis still good, it isn't as powerful as it once was.

Let's do one more change to our example database (Figure 9-5). We'll DELETE with this SQL statement:

 DELETE FROM Table1    WHERE column1 = 'CARLA' 

Now, assuming the DBMS has deleted the obvious way, the example database won't have any ghost rows but will still look rather messy. Figure 9-6 shows what it looks like now; this is what is known as a fragmented database.

Figure 9-6. Example database after DELETEfragmented

graphics/09fig06.gif

Fragmentation

So far, we've done two INSERTs and one DELETE on the example database, and the result is a horrible mess! Look at everything that's wrong with Figure 9-6:

  • The leaf pages are out of order. At this stage, if the DBMS wants to go through the pages in logical order by key valuea common requirement for a range search, an ORDER BY, or a GROUP BYit must jump from leaf page #2 to page #4 and then back to page #3. The situation won't affect searches for a single row with an equals operator, but that hides the effect only for a while. As we learned in Chapter 8, this problem is called fragmentation.

  • The leaf pages vary in fullness. Leaf pages #1 and #4 are only 50% full. Leaf page #3 is 100% full, and the next insertion will almost certainly cause a split. This problem is called skew .

  • And finally, the root looks ready for a split on the very next insertion, regardless of what gets changed. Remember, splits in upper layers happen "in anticipation" so a split may occur even when the leaf page won't overflow. And a split on the root is an especially bad thing, because after a root split happens, there is one more layerand when there is one more layer, there is one more I/O for every index lookup. And when that happens, everything takes 50% longer. That's unfair in this case, because the leaf pages aren't really full, or even 70% full on average.

Hey, we admit this isn't a real example. In particular, by limiting the page size so the maximum number of rows per page is four, and by making changes in precisely the wrong places, we've made it look very easy to cause a mess. In fact, though, on a real database, the rule of thumb is that you don't have to worry about fragmentation and skewing until the number of rows you've changed or added is equal to more than 5% of the total number of rows in the database. After that, it's probably time to rebuild. Tools are available to check how messy an index looks, but they're vendor-specific so we'll skip them and proceed right to the rebuilding step.

Rebuilding a B-tree

If you've followed us this far, you know that rebuilding will be an inevitable part of database life until DBMSs learn how to rearrange indexes automatically during regular activity. Skew and fragmentation cause a need for cleanup. There are two ways to do it: ALTER INDEX/REBUILD or DROP INDEX/RECREATE.

ALTER INDEX/REBUILD

We're being vague with our syntax here because it varies, but most DBMSs support some type of nonstandard SQL extension that will "rebuild" an index. The important thing is that rebuilding doesn't really happen. What does happen is shown, in pseudocode, in Listing 9-1.

Listing 9-1 Pseudocode of ALTER INDEX/REBUILD
 for (each leaf page)  if (first index key in page could fit in prior leaf page)   if (insert wouldn't overflow past index's free page space)    shift this index key to the prior leaf page    shift following keys up; adjust intermediate-page pointers    repeat for (each intermediate page)  do approximately the same things as for leaf-page loop 

We have left out some important matters in Listing 9-1, such as what to do if a page becomes empty and the index can be shrunk. But the important thing is that the algorithm will fix skew. It won't, however, fix fragmentation. Table 9-1 shows the command each of the Big Eight provides for index rebuilding.

DROP INDEX/RECREATE

Although it's reasonable to guess that it will take 100 times longer than ALTER INDEX/REBUILD, sometimes the only way to fix an index is to drop the index (with the nonstandard SQL-extension statement DROP INDEX) and then remake it with CREATE INDEX. And even that won't be enough unless you also run an operating system defragger program. We can only say that DROP INDEX/RECREATE should be more frequent than it is, but there is no rule of thumb and no consensus about this.

In this context, let's address some general issues about "bulk changes"changing many keys at once. Long ago, someone had the insight that if bulk changes were made in order by key value, the process would go much more quickly because the DBMS would always be going forward to a later page in the index. It would never have to backtrack, and therefore the heads would not thresh back and forth on the disk drive.

This is relevant to CREATE INDEX because, as it happens, all DBMSs have optimized CREATE INDEX so that the DBMS will sort the keys before putting them in the index. That's why you'll often see the tipLoad data first, make index second. Presorting the keys yourself is so much better that you'll even see this tipIf you're going to change more than 5% of the keys, drop the index first and create it again when the changes are over. (Of course, you'll make sure that the relevant table can't be used by another transaction while you're doing this!)

Table 9-1. DBMS ALTER INDEX/REBUILD Support
  Rebuild Command
IBM REBUILD INDEX utility
Informix none: vendor recommends DROP INDEX and re-create
Ingres MODIFY TO MERGE
InterBase ALTER INDEX INACTIVE/ACTIVE
Microsoft DBCC DBREINDEX
MySQL none: vendor recommends OPTIMIZE TABLE
Oracle ALTER INDEX REBUILD
Sybase dbcc reindex

In an ideal world, you would sort the rows you were about to process (e.g., by UPDATE statements), so that the DBMS could update the index in key order, but within a pagedoing deletions before insertions so that room would be guaranteed for the new keys. The latest version of Microsoft's DBMS does such sorting automatically. Finally, most DBMSs will allow options like "disable index" (e.g., MySQL, Oracle) or " defer index key write" (e.g., MySQL). These options are confessions of weakness on the DBMS maker's part, but that is no reason to disdain them.

Tip

Just before a peak period, anticipate by rebuilding your index even if it's not fragmented. Use PCTFREE/FILLFACTOR, so that lots of space will be in each index page when the split times come.


The Bottom Line: B-trees

All DBMSs depend on B-treesthey are the default structures built when you create an index and are also the structures that will be used to enhance retrieval in most cases.

A B-tree is sortable because you travel through a B-tree's keys in order.

A B-tree is selective because you want to get only some of the rows.

A B-tree is a subset because the key value (the part of an index key other than the pointer) is completely derived from the column value in the data row.

A B-tree is balanced because the distance from the root to any leaf is always the same.

When searching a B-tree, the DBMS is scanning keys that are known to be in order, so it can do a binary searchnot a sequential searchof the keys in the page. Thus it makes no difference whether keys are at the start or at the end of the pagesearch time is the same.

The number of B-tree index levels equals the number of index page reads, always, except that you can be fairly confident that the root page is in a cache.

The critical thing is the number of B-tree layersnot the total index sizeand that's why it's good to have big pages. Keep the number of layers small!

A rule of thumb for B-trees is that the number of layers should never exceed five. If there are more than five layers, it's time for drastic measures like partitioning.

The rules for index splitting are:

  • At the leaf level, splits should only occur when there is no chance of fitting the new key in the current page.

  • Splits should be 50/50after the split, the total number of bytes in the keys of the current page should be about the same as in the new page.

  • A pointer to the new leaf page must be added into the appropriate page at the intermediate level.

  • A split must not cascade.

Because of splits, the speed of data-change statements may vary unpredictably. To reduce that unpredictability, you can instruct the DBMS to leave a little bit of space in every page and therefore make splits less likely.

Because of the rules of splits, a page will be less than 100% full after a large number of random key insertions (because of wasted space at the end of the page) but more than 50% full (because a split makes pages 50% full and then there's room for some INSERTs that won't cause a split). The rule of thumb to follow is to assume pages are 70% full if insertions are random, or more than that if insertions are done in an ascending sequence.

Don't repeatedly change an indexed column to an ever-newer value.

You don't have to worry about fragmentation and skewing until the number of rows you've changed or added is equal to more than 5% of the total number of rows in the database. After that, it's probably time to rebuild.

Skew and fragmentation cause a need for cleanup. There are two ways to do it: ALTER INDEX/REBUILD or DROP INDEX/RECREATE (syntax deliberately vague).

If bulk changes are made in order by key value, the process goes much more quickly because the DBMS is always going forward to a later page in the index. Soload data first, make index second.

If you're going to change more than 5% of the keys, drop the index first and create it again when the changes are over.

Just before a peak period, anticipate by rebuilding your index even if it's not fragmented. Use PCTFREE/FILLFACTOR, so that each index page will have lots of space when the split times come.

   


SQL Performance Tuning
SQL Performance Tuning
ISBN: 0201791692
EAN: 2147483647
Year: 2005
Pages: 125

Similar book on Amazon

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net