Section 7.4. Walking a Tree with SQL

7.4. Walking a Tree with SQL

In order to check efficiency and performance, I have compared how each model performed with respect to the following two problems:

To find all the units under the command of the French general Dominique Vandamme (a top-down query), if possible as an indented report (which requires keeping track of the depth within the tree) or as a simple list. Note that in all cases we have an index on the commander's name. I refer to this problem as the Vandamme query.
To find, for all regiments of Scottish Highlanders, the various units they belong to, once again with and without proper indentation (a bottom-up query). We have no index on the names of units (column description in the tables), and our only way to spot Scottish Highlanders is to look for the Highland string in the name of the unit, which of course means a full scan in the absence of any full-text indexing. I refer to this problem as the Highlanders query.

To ensure that the only variation from test to test was in the model used, my comparisons are all done using the same DBMS, namely Oracle.

7.4.1. Top-Down Walk: The Vandamme Query

In the Vandamme query, we start with the commander of the French Third Corps, General Vandamme, and want to display in an orderly fashion all units under his command. We don't want a simple list: the structure of the army corps must be clear, as the corps is made of divisions that are themselves made of brigades that are themselves usually composed of two regiments.

7.4.1.1. Adjacency model

Writing the Vandamme query with the adjacency model is fairly easy when using Oracle's connect by operator. All you have to specify is the node you wish to start from (start with) and how each two successive rows returned relate to each other (connect by <a column of the current row> = prior <a column of the previous row>, or connect by <a column of the previous row> = prior <a column of the current row>, depending on whether you are walking down or up the tree). For indentation, Oracle maintains a pseudo-column named level that tells you how many levels away from the starting point you are. I am using this pseudo-column and left-padding the description with as many spaces as the current value of level. My query is:

     select lpad(description, length(description) + level) description,            commander     from adjacency_model     connect by parent_id = prior id     start with commander = 'Général de Division Dominique Vandamme'

And the results are:

     DESCRIPTION                     COMMANDER     ------------------------------- -----------------------------------------------      III Corps                      Général de Division Dominique Vandamme       8th Infantry Division         Général de Division Baron Etienne-Nicolas Lefol        2nd Brigade                  Général de Brigade Baron Corsin         37th Rgmt de Ligne          Colonel Cornebise        1st Brigade                  Général de Brigade Billard (d.15th)         23rd Rgmt de Ligne          Colonel Baron Vernier         15th Rgmt Léger             Colonel Brice            ...       10th Infantry Division        Général de Division Baron Pierre-Joseph Habert        2nd Brigade                  Général de Brigade Baron Dupeyroux         70th Rgmt de Ligne          Colonel Baron Maury         22nd Rgmt de Ligne          Colonel Fantin des Odoards         2nd (Swiss) Infantry Rgmt   Colonel Stoffel        1st Brigade                  Général de Brigade Baron Gengoult         88th Rgmt de Ligne          Colonel Baillon         34th Rgmt de Ligne          Colonel Mouton        Division Artillery         18/2nd Foot Artillery       Captain Guérin     40 rows selected.

Now, what about the other member in the adjacency family, the recursive with statement?^[*] With this model, a recursive-factorized statement is defined, which is made of the union (the union all, to be precise) of two select statements:

^[*] Using this time the first product that implemented it, namely DB2.

The select that defines our starting point, which in this particular case is:
```
     select 1 level,            id,            description,            commander     from adjacency_model     where commander = 'Général de Division Dominique Vandamme' 
```
What is this solitary 1 for? It represents, as the alias indicates, the depth in the tree. In contrast to the Oracle connect by implementation, this DB2 implementation has no system pseudo-variable to tell us where we are in the tree. We can compute our level, however, and I'll explain more about that in just a moment.
The select which defines how each child row relates to its parent row, as it is returned by this very same query that we can call, with a touch of originality, recursive_query:
```
     select parent.level + 1,            child.id,            child.description,            child.comander     from recursive_query parent,          adjacency_model child     where parent.id = child.parent_id 
```
Notice in this query that we add 1 to parent.level. Each execution of this query represents a step down the tree. For each step down the tree, we increment our level, thus keeping track of our depth.

All that's left is to fool around with functions to nicely indent the description, and here is our final query:

     with recursive_query(level, id, description, commander)     as (select 1 level,                id,                description,                commander         from adjacency_model         where commander = 'Général de Division Dominique Vandamme'         union all         select parent.level + 1,                child.id,                child.description,                child.commander         from recursive_query parent,              adjacency_model child         where parent.id = child.parent_id)     select char(concat(repeat(' ', level), description), 60) description,            commander     from recursive_query

Of course, you have to be a real fan of the recursive with to be able to state without blushing that the syntax here is natural and obvious. However, it is not too difficult to understand once written; and it's even rather satisfactory, except that the query first returns General Vandamme as expected, but then all the officers directly reporting to him, and then all the officers reporting to the first one at the previous level, followed by all officers reporting to the second one at the previous level, and so on. The result is not quite the nice top-to-bottom walk of the connect by, showing exactly who reports to whom. I'll hasten to say that since ordering doesn't belong to the relational theory, there is nothing wrong with the ordering that you get from with, but that ordering does raise an important question: in practice, how can we order the rows from a hierarchical query?

Ordering the rows from a hierarchical query using recursive with is indeed possible if, for instance, we make the not unreasonable assumption that one parent node never has more than 99 children and that the tree is not monstrously deep. Given these caveats, what we can do is associate with each node a number that indicates where it is located in the hierarchysay 1.030801--to mean the first child (the two rightmost digits) of the eighth child (next two digits, from right to left) of the third child of the root node. This assumes, of course, that we are able to order siblings, and we may not always be able to assign any natural ordering to them. Sometimes it is necessary to arbitrarily assign an order to each sibling using, possibly, an OLAP function such as row_number( ) .

We can therefore slightly modify our previous query to arbitrarily assign an order to siblings and to use the just-described technique for ordering the result rows:

     with recursive_query(level, id, rank, description, commander)     as (select 1,                id,                cast(1 as double),                description,                commander         from adjacency_model         where commander = 'Général de Division Dominique Vandamme'         union all         select parent.level + 1,                child.id,                parent.rank + ranking.sn / power(100.0, parent.level),                child.description,                child.commander         from recursive_query parent,              (select id,                      row_number( ) over (partition by parent_id                                         order by description) sn               from adjacency_model) ranking,              adjacency_model child         where parent.id =child.parent_id           and child.id = ranking.id)     select char(concat(repeat(' ', level), description), 60) description,            commander     from recursive_query     order by rank

We might fear that the ranking query that appears as a recursive component of the query would be executed for each node in the tree that we visit, returning the same result set each time. This isn't the case. Fortunately, the optimizer is smart enough not to execute the ranking query more than is necessary, and we get:

     DESCRIPTION                    COMMANDER     -----------------------------  ----------------------------------------------      III Corps                     Général de Division Dominique Vandamme       10th Infantry Division       Général de Division Baron Pierre-Joseph Habert        1st Brigade                 Général de Brigade Baron Gengoult         34th Rgmt de Ligne         Colonel Mouton         88th Rgmt de Ligne         Colonel Baillon        2nd Brigade                 Général de Brigade Baron Dupeyroux         22nd Rgmt de Ligne         Colonel Fantin des Odoards         2nd (Swiss) Infantry Rgmt  Colonel Stoffel         70th Rgmt de Ligne         Colonel Baron Maury        Division Artillery         18/2nd Foot Artillery      Captain Guérin       11th Infantry Division       Général de Division Baron Pierre Berthézène        ...         23rd Rgmt de Ligne         Colonel Baron Vernier        2nd Brigade                 Général de Brigade Baron Corsin         37th Rgmt de Ligne         Colonel Cornebise        Division Artillery         7/6th Foot Artillery       Captain Chauveau       Reserve Artillery            Général de Division Baron Jérôme Doguereau        1/2nd Foot Artillery        Captain Vollée        2/2nd Rgmt du Génie

The result is not strictly identical to the connect by case, simply because we have ordered siblings by alphabetical order on the description column, while we didn't order siblings at all with connect by (we could have ordered them by adding a special clause). But otherwise, the very same hierarchy is displayed.

While the result of the with query is logically equivalent to that of the connect by query, the with query is a splendid example of nightmarish, obfuscated SQL, which in comparison makes the five-line connect by query look like a model of elegant simplicity. And even if on this particular example performance is more than acceptable, one can but wonder with some anguish at what it might be on very large tables. Must we disregard the recursive with as a poor, substandard implementation of the superior connect by? Let's postpone conclusions until the end of this chapter.

The ranking number we built in the recursive query is nothing more than a numerical representation of the materialized path. It is therefore time to check how we can display the troops under the command of General Vandamme using a simple materialized path implementation.

7.4.1.2. Materialized path model

Our query is hardly more difficult to write under the materialized path model but for the level, which is derived from the path itself. Let's assume just for an instant that we have at hand a function named mp_depth( ) that returns the number of hierarchical levels between the current node and the top of the tree. We can write a query as:

     select lpad(a.description, length(a.description)             + mp_depth(...)) description,            a.commander     from materialized_path_model a,          materialized_path_model b     where a.materialized_path like b.materialized_path || '%'       and b.commander = 'Général de Division Dominique Vandamme')     order by a.materialized_path

Before dealing with the mp_depth( ) function, I'll note a few traps. In my example, I have chosen to start the materialized path with A for the Anglo-Dutch army, P for the Prussian one, and F for the French one. That first letter is then followed by dot-separated digits. Thus, the 12th Dutch line battalion, under the command of Colonel Bagelaar, is A.1.4.2.3, while the 11th Régiment of Cuirassiers of Colonel Courtier is F.9.1.2.2. Ordering by materialized path can lead to the usual problems of alphabetical sorts of strings of digits, namely that 10.2 will be returned before 2.3; however, I should stress that, since the separator has a lower code (in ASCII at least) than 0, then the order of levels will be respected. The sort may not, however respect the order of siblings implied by the path. Does that matter? I don't believe that it does because sibling order is usually information that can be derived from something other than the materialized path itself (for instance, brothers and sisters can be ordered by their birth dates, rather than by the path). Be careful with the approach to sorting that I've used here. The character encoding used by your database might throw off the results.

What about our mysterious mp_depth( ) function now? The hierarchical difference between any commander under General Vandamme and General Vandamme himself can be defined as the difference between the absolute levels (i.e., counting down from the root of the tree) of the unit commanded by General Vandamme and any of the underlying units. How then can we determine the absolute level? Well, by counting the dots.

To count the dots, the easiest thing to do is probably to start with suppressing them, with the help of the replace( ) function that you find in the SQL dialect of all major products. All you have to do next is subtract the length of the string without the dots from the length of the string with the dots, and you get exactly what you want, the dot-count:

     length((materialized_path) - length(replace(materialized_path, '.', ''))

If we check the result of our dot-counting algorithm for the author of the epigraph that adorns Chapter 6 (a cavalry colonel at the time), here is what we get:

     SQL> select materialized_path,       2        length(materialized_path) len_w_dots,       3        length(replace(materialized_path, '.', '')) len_wo_dots,       4        length(materialized_path) -       5             length(replace(materialized_path, '.', '')) depth,       6        commander       7  from materialized_path_model       8  where commander = 'Colonel de Marbot'       9  /     MATERIALIZED_PATH LEN_W_DOTS LEN_WO_DOTS      DEPTH COMMANDER     ----------------- ---------- ----------- ---------- ------------------     F.1.5.1.1                  9           5          4 Colonel de Marbot

Et voilà.

7.4.1.3. Nested sets model

Finding all the units under the command of General Vandamme is very easy under the nested sets model, since the model requires us to have numbered our nodes in such a way that the left_num and right_num of a node bracket are the left_num and right_num of all descendants. All we have to write is:

     select a.description,            a.commander     from nested_sets_model a,          nested_sets_model b     where a.left_num between b.left_num and b.right_num       and b.commander = 'Général de Division Dominique Vandamme'

All? Not quite. We have no indentation here. How do we get the level? Unfortunately, the only way we have to get the depth of a node (from which indentation is derived) is by counting how many nodes we have between that node and the root. There is no way to derive depth from left_num and right_num (in contrast to the materialized path model).

If we want to display an indented list under the nested sets model, then we must join a third time with our nested_sets_model table, for the sole purpose of computing the depth:

     select lpad(description, length(description) + depth) description,            commander     from (select count(c.left_num) depth,                  a.description,                  a.commander,                  a.left_num           from nested_sets_model a,                nested_sets_model b,                nested_sets_model c           where a.left_num between c.left_num and c.right_num             and c.left_num between b.left_num and b.right_num             and b.commander = 'Général de Division Dominique Vandamme'           group by a.description,                    a.commander,                    a.left_num)     order by left_num

The simple addition of the indentation requirement makes the query, as with (sic) the recursive with( ), somewhat illegible.

7.4.1.4. Comparing the Vandamme query under the various models

After having checked that all queries were returning the same 40 rows properly indented, I then ran each of the queries 5,000 times in a loop (thus returning a total of 200,000 rows). I have compared the number of rows returned per second, taking the adjacency model as our 100-mark reference. You see the results in Figure 7-2.

Figure 7-2. Performance comparison for the Vandamme query

As Figure 7-2 shows, for the Vandamme query, the adjacency model, in which the tree is walked using connect by, outperforms the competition despite the procedural nature of connect by. The materialized path makes a decent show, but probably suffers from the function calls to compute the depth and therefore the indentation. The cost of a nicely indented output is even more apparent with the nested sets model, where the obvious performance killer is the computation of the depth through an additional join and a group by. One might cynically suggest that, since this model is totally hard-wired, static, and non-relational, we might as well go whole hog in ignoring relational design tenets and store the depth of each node relative to the root. Doing so would certainly improve our query's performance, but at a horrendous cost in terms of maintenance.

7.4.2. Bottom-Up Walk: The Highlanders Query

As I said earlier, looking for the Highland string within the description attributes will necessarily lead to a full scan of the table. But let's write our query with each of the models in turn, and then we'll consider the resulting performance implications.

7.4.2.1. Adjacency model

The Highlanders query is very straightforward to write using connect by, and once again we use the dynamically computed level pseudo-column to indent our result properly. Note that level was previously giving the depth, and now it returns the height since it is always computed from our starting point, and that we now return the parent after the child:

     select lpad(description, length(description) + level) description,            commander     from adjacency_model     connect by id = prior parent_id     start with description like '%Highland%'

And here is the result that we get:

     DESCRIPTION                        COMMANDER     ---------------------------------- ----------------------------------------      2/73rd (Highland) Rgmt of Foot    Lt-Colonel William George Harris       5th British Brigade              Major-General Sir Colin Halkett        3rd Anglo-German Division       Lt-General Count Charles von Alten         I Corps                        Prince William of Orange          The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of                                        Wellington      1/71st (Highland) Rgmt of Foot    Lt-Colonel Thomas Reynell       British Light Brigade            Major-General Frederick Adam        2nd Anglo-German Division       Lt-General Sir Henry Clinton         II Corps                       Lieutenant-General Lord Rowland Hill          The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of                                        Wellington      1/79th (Highland) Rgmt of Foot    Lt-Colonel Neil Douglas       8th British Brigade              Lt-General Sir James Kempt        5th Anglo-German Division       Lt-General Sir Thomas Picton (d.18th)         General Reserve                Duke of Wellington          The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of                                        Wellington      1/42nd (Highland) Rgmt of Foot    Colonel Sir Robert Macara (d.16th)       9th British Brigade              Major-General Sir Denis Pack        5th Anglo-German Division       Lt-General Sir Thomas Picton (d.18th)         General Reserve                Duke of Wellington          The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of                                        Wellington      1/92nd (Highland) Rgmt of Foot    Lt-Colonel John Cameron       9th British Brigade              Major-General Sir Denis Pack        5th Anglo-German Division       Lt-General Sir Thomas Picton (d.18th)         General Reserve                Duke of Wellington          The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of                                        Wellington     25 rows selected.

The non-relational nature of connect by appears plainly enough: our result is not a relation, since we have duplicates. The name of the Duke of Wellington appears eight times, but in two different capacities, five times (as many times as we have Highland regiments) as commander-in-chief, and three as commander of the General Reserve. Twiceonce as commander of the General Reserve and once as commander-in-chiefwould have been amply sufficient. Can we easily remove the duplicates? No we cannot, at least not easily. If we apply a distinct, the DBMS will sort our result to get rid of the duplicate rows and will break the hierarchical order. We get a result that somehow answers the question. But you can take it or leave it according to the details of your requirements.

7.4.2.2. Materialized path model

The Highlanders query is slightly more difficult to write under the materialized path model . Identifying the proper rows and indenting them correctly is easy:

     select lpad(a.description, length(a.description)                            + mp_depth(b.materialized_path)                              - mp_depth(a.materialized_path)) description,            a.commander     from materialized_path_model a,          materialized_path_model b     where b.materialized_path like a.materialized_path || '%'       and b.description like '%Highland%')

However, we have two issues to solve:

We have duplicates, as with the adjacency model.
The order of rows is not the one we want.

Paradoxically, the second issue is the reason why we can solve the first one easily; since we shall have to find a means of correctly ordering anyway, adding a distinct will break nothing in this case. How can we order correctly? As usual, by using the materialized path as our sort key. By adding these two elements and pushing the query into the from clause so as to be able to sort by materialized_path without displaying the column, we get:

     select description, commander     from (select distinct lpad(a.description, length(a.description)                            + mp_depth(b.materialized_path)                              - mp_depth(a.materialized_path)) description,                            a.commander,                            a.materialized_path           from materialized_path_model a,                materialized_path_model b           where b.materialized_path like a.materialized_path || '%'             and b.description like '%Highland%')     order by materialized_path desc

which displays:

     DESCRIPTION                         COMMANDER     ---------------------------------- ----------------------------------------     1/92nd (Highland) Rgmt of Foot      Lt-Colonel John Cameron     1/42nd (Highland) Rgmt of Foot      Colonel Sir Robert Macara (d.16th)      9th British Brigade                Major-General Sir Denis Pack     1/79th (Highland) Rgmt of Foot      Lt-Colonel Neil Douglas      8th British Brigade                Lt-General Sir James Kempt       5th Anglo-German Division         Lt-General Sir Thomas Picton (d.18th)        General Reserve                  Duke of Wellington     1/71st (Highland) Rgmt of Foot      Lt-Colonel Thomas Reynell      British Light Brigade              Major-General Frederick Adam       2nd Anglo-German Division         Lt-General Sir Henry Clinton        II Corps                         Lieutenant-General Lord Rowland Hill     2/73rd (Highland) Rgmt of Foot      Lt-Colonel William George Harris      5th British Brigade                Major-General Sir Colin Halkett       3rd Anglo-German Division         Lt-General Count Charles von Alten        I Corps                          Prince William of Orange         The Anglo-Allied Army of 1815   Field Marshal Arthur Wellesley, Duke of                                         Wellington     16 rows selected.

This is a much nicer and more compact result than is achieved with the adjacency model. However, I should point out that a condition such as:

     where b.materialized_path like a.materialized_path || '%'

where we are looking for a row in the table aliased by a, knowing the rows in the table aliased by b, is something that, generally speaking, may be slow because we can't make efficient use of the index on the column. What we would like to do, to make efficient use of the index, is the opposite, looking for b.materialized_path knowing a.materialized_path. There are ways to decompose a materialized path into the list of the materialized paths of the ancestors of the node (see Chapter 11), but that operation is not without cost. On our sample data, the query we have here was giving far better results than decomposing the material path so as to perform a more efficient join with the materialized path of each ancestor. However, this might not be true against several million rows.

7.4.2.3. Nested sets model

Once again, what hurts this model is that the depth must be dynamically computed, and that computation is a rather heavy operation. Since the Highlanders query is a bottom-up query, we must take care not to display the artificial root node (easily identified by left_num = 1) that we have had to introduce. Moreover, I have had to hard-code the maximum depth (6) to be able to indent properly. In our display, top levels are more indented than bottom levels, which means that padding is inversely proportional to depth. Since the depth is difficult to get, defining the indentation as 6 - depth was the simplest way to achieve the required result.

As with the materialized path model, we have to reorder anyway, so we have no scruple about applying a distinct to get rid of duplicate rows. Here's the query:

     select lpad(description, length(description) + 6 - depth) description,            commander     from (select distinct b.description,                           b.commander,                           b.left_num,                           (select count(c.left_num)                            from nested_sets_model c                            where b.left_num between c.left_num                                                 and c.right_num) depth           from nested_sets_model a,                nested_sets_model b           where a.description like '%Highland%'             and a.left_num between b.left_num and b.right_num             and b.left_num > 1)     order by left_num desc

This query displays exactly the same result as does the materialized path query in the preceding section.

7.4.2.4. Comparing the various models for the Highlanders query

I have applied the same test to the Highlanders query as to the Vandamme query earlier, running each of the queries 5,000 times, with a minor twist: the adjacency model, as we have seen, returns duplicate rows that we cannot get rid of. My test returns 5,000 times 25 rows for the adjacency model, and 5,000 times 16 rows with the other models, because they are the only rows of interest. If we measure performance as a simple number of rows returned by unit of time, with the adjacency model we are also counting many rows that we are not interested in. I have therefore added an adjusted adjacency model, for which performance is measured as the number of rows of interestthe rows returned by the other two modelsper unit of time. The result is given in Figure 7-3.

It is quite obvious from Figure 7-3 that the adjacency model outperforms the two other models by a very wide margin before adjustment, and still by a very comfortable margin after adjustment. Also notice that the materialized path model is still faster than the nested sets model, but only marginally so.

Figure 7-3. Performance comparison for the Highlanders query

We therefore see that, in spite of its procedural nature, the implementation of the connect by works rather well, both for top-down and bottom-up queries, provided of course that columns are suitably indexed. However, the return of duplicate rows in bottom-up queries when there are several starting points can prove to be a practical nuisance.

When connect by or a recursive with is not available, the materialized path model makes a good substitute. It is interesting to see that it performs better than the totally hard-wired nested sets model.

When designing tables to store hierarchical data, there are a number of mistakes to avoid, some of which are made in our example:

The materialized path should in no way be the key, even if it is unique.: It is true that strong hierarchies are not usually associated with dynamic environments, but you are not defined by your place in a hierarchy.
The materialized path should not imply any ordering of siblings.: Ordering does not belong to a relational model; it is simply concerned with the presentation of data. You must not have to change anything in other rows when you insert a new node or delete an existing one (which is probably the biggest practical reason, forgetting about all theoretical reasons, for not using the nested sets model). It is always easy to insert a node as the parents' last child. You can order everything first by sorting on the materialized path of the parent, and then on whichever attribute looks suitable for ordering the siblings.
The choice of the encoding is not totally neutral.: The choice is not neutral because whether you must sort by the materialized path or by the parent's materialized path, you must use that path as a sort key. The safest approach is probably to use numbers left padded with zeroes, for instance 001.003.004.005 (note that if we always use three positions for each number, the separator can go). You might be afraid of the materialized path's length; but if we assume that each parent never has more than 100 children numbered from 0 to 99, 20 characters allow us to store a materialized path for up to 10 levels, or trees containing up to 100¹⁰ nodesprobably more than needed.

Walking trees, whether down from the root or up from a leaf node, is by nature a sequential and therefore slow operation.