7.4. Walking a Tree with SQLIn order to check efficiency and performance, I have compared how each model performed with respect to the following two problems:
To ensure that the only variation from test to test was in the model used, my comparisons are all done using the same DBMS, namely Oracle. 7.4.1. Top-Down Walk: The Vandamme QueryIn the Vandamme query, we start with the commander of the French Third Corps, General Vandamme, and want to display in an orderly fashion all units under his command. We don't want a simple list: the structure of the army corps must be clear, as the corps is made of divisions that are themselves made of brigades that are themselves usually composed of two regiments. 7.4.1.1. Adjacency modelWriting the Vandamme query with the adjacency model is fairly easy when using Oracle's connect by operator. All you have to specify is the node you wish to start from (start with) and how each two successive rows returned relate to each other (connect by <a column of the current row> = prior <a column of the previous row>, or connect by <a column of the previous row> = prior <a column of the current row>, depending on whether you are walking down or up the tree). For indentation, Oracle maintains a pseudo-column named level that tells you how many levels away from the starting point you are. I am using this pseudo-column and left-padding the description with as many spaces as the current value of level. My query is: select lpad(description, length(description) + level) description, commander from adjacency_model connect by parent_id = prior id start with commander = 'Général de Division Dominique Vandamme' And the results are: DESCRIPTION COMMANDER ------------------------------- ----------------------------------------------- III Corps Général de Division Dominique Vandamme 8th Infantry Division Général de Division Baron Etienne-Nicolas Lefol 2nd Brigade Général de Brigade Baron Corsin 37th Rgmt de Ligne Colonel Cornebise 1st Brigade Général de Brigade Billard (d.15th) 23rd Rgmt de Ligne Colonel Baron Vernier 15th Rgmt Léger Colonel Brice ... 10th Infantry Division Général de Division Baron Pierre-Joseph Habert 2nd Brigade Général de Brigade Baron Dupeyroux 70th Rgmt de Ligne Colonel Baron Maury 22nd Rgmt de Ligne Colonel Fantin des Odoards 2nd (Swiss) Infantry Rgmt Colonel Stoffel 1st Brigade Général de Brigade Baron Gengoult 88th Rgmt de Ligne Colonel Baillon 34th Rgmt de Ligne Colonel Mouton Division Artillery 18/2nd Foot Artillery Captain Guérin 40 rows selected. Now, what about the other member in the adjacency family, the recursive with statement?[*] With this model, a recursive-factorized statement is defined, which is made of the union (the union all, to be precise) of two select statements:
All that's left is to fool around with functions to nicely indent the description, and here is our final query: with recursive_query(level, id, description, commander) as (select 1 level, id, description, commander from adjacency_model where commander = 'Général de Division Dominique Vandamme' union all select parent.level + 1, child.id, child.description, child.commander from recursive_query parent, adjacency_model child where parent.id = child.parent_id) select char(concat(repeat(' ', level), description), 60) description, commander from recursive_query Of course, you have to be a real fan of the recursive with to be able to state without blushing that the syntax here is natural and obvious. However, it is not too difficult to understand once written; and it's even rather satisfactory, except that the query first returns General Vandamme as expected, but then all the officers directly reporting to him, and then all the officers reporting to the first one at the previous level, followed by all officers reporting to the second one at the previous level, and so on. The result is not quite the nice top-to-bottom walk of the connect by, showing exactly who reports to whom. I'll hasten to say that since ordering doesn't belong to the relational theory, there is nothing wrong with the ordering that you get from with, but that ordering does raise an important question: in practice, how can we order the rows from a hierarchical query? Ordering the rows from a hierarchical query using recursive with is indeed possible if, for instance, we make the not unreasonable assumption that one parent node never has more than 99 children and that the tree is not monstrously deep. Given these caveats, what we can do is associate with each node a number that indicates where it is located in the hierarchysay 1.030801--to mean the first child (the two rightmost digits) of the eighth child (next two digits, from right to left) of the third child of the root node. This assumes, of course, that we are able to order siblings, and we may not always be able to assign any natural ordering to them. Sometimes it is necessary to arbitrarily assign an order to each sibling using, possibly, an OLAP function such as row_number( ) . We can therefore slightly modify our previous query to arbitrarily assign an order to siblings and to use the just-described technique for ordering the result rows: with recursive_query(level, id, rank, description, commander) as (select 1, id, cast(1 as double), description, commander from adjacency_model where commander = 'Général de Division Dominique Vandamme' union all select parent.level + 1, child.id, parent.rank + ranking.sn / power(100.0, parent.level), child.description, child.commander from recursive_query parent, (select id, row_number( ) over (partition by parent_id order by description) sn from adjacency_model) ranking, adjacency_model child where parent.id =child.parent_id and child.id = ranking.id) select char(concat(repeat(' ', level), description), 60) description, commander from recursive_query order by rank We might fear that the ranking query that appears as a recursive component of the query would be executed for each node in the tree that we visit, returning the same result set each time. This isn't the case. Fortunately, the optimizer is smart enough not to execute the ranking query more than is necessary, and we get: DESCRIPTION COMMANDER ----------------------------- ---------------------------------------------- III Corps Général de Division Dominique Vandamme 10th Infantry Division Général de Division Baron Pierre-Joseph Habert 1st Brigade Général de Brigade Baron Gengoult 34th Rgmt de Ligne Colonel Mouton 88th Rgmt de Ligne Colonel Baillon 2nd Brigade Général de Brigade Baron Dupeyroux 22nd Rgmt de Ligne Colonel Fantin des Odoards 2nd (Swiss) Infantry Rgmt Colonel Stoffel 70th Rgmt de Ligne Colonel Baron Maury Division Artillery 18/2nd Foot Artillery Captain Guérin 11th Infantry Division Général de Division Baron Pierre Berthézène ... 23rd Rgmt de Ligne Colonel Baron Vernier 2nd Brigade Général de Brigade Baron Corsin 37th Rgmt de Ligne Colonel Cornebise Division Artillery 7/6th Foot Artillery Captain Chauveau Reserve Artillery Général de Division Baron Jérôme Doguereau 1/2nd Foot Artillery Captain Vollée 2/2nd Rgmt du Génie The result is not strictly identical to the connect by case, simply because we have ordered siblings by alphabetical order on the description column, while we didn't order siblings at all with connect by (we could have ordered them by adding a special clause). But otherwise, the very same hierarchy is displayed. While the result of the with query is logically equivalent to that of the connect by query, the with query is a splendid example of nightmarish, obfuscated SQL, which in comparison makes the five-line connect by query look like a model of elegant simplicity. And even if on this particular example performance is more than acceptable, one can but wonder with some anguish at what it might be on very large tables. Must we disregard the recursive with as a poor, substandard implementation of the superior connect by? Let's postpone conclusions until the end of this chapter. The ranking number we built in the recursive query is nothing more than a numerical representation of the materialized path. It is therefore time to check how we can display the troops under the command of General Vandamme using a simple materialized path implementation. 7.4.1.2. Materialized path modelOur query is hardly more difficult to write under the materialized path model but for the level, which is derived from the path itself. Let's assume just for an instant that we have at hand a function named mp_depth( ) that returns the number of hierarchical levels between the current node and the top of the tree. We can write a query as: select lpad(a.description, length(a.description) + mp_depth(...)) description, a.commander from materialized_path_model a, materialized_path_model b where a.materialized_path like b.materialized_path || '%' and b.commander = 'Général de Division Dominique Vandamme') order by a.materialized_path Before dealing with the mp_depth( ) function, I'll note a few traps. In my example, I have chosen to start the materialized path with A for the Anglo-Dutch army, P for the Prussian one, and F for the French one. That first letter is then followed by dot-separated digits. Thus, the 12th Dutch line battalion, under the command of Colonel Bagelaar, is A.1.4.2.3, while the 11th Régiment of Cuirassiers of Colonel Courtier is F.9.1.2.2. Ordering by materialized path can lead to the usual problems of alphabetical sorts of strings of digits, namely that 10.2 will be returned before 2.3; however, I should stress that, since the separator has a lower code (in ASCII at least) than 0, then the order of levels will be respected. The sort may not, however respect the order of siblings implied by the path. Does that matter? I don't believe that it does because sibling order is usually information that can be derived from something other than the materialized path itself (for instance, brothers and sisters can be ordered by their birth dates, rather than by the path). Be careful with the approach to sorting that I've used here. The character encoding used by your database might throw off the results. What about our mysterious mp_depth( ) function now? The hierarchical difference between any commander under General Vandamme and General Vandamme himself can be defined as the difference between the absolute levels (i.e., counting down from the root of the tree) of the unit commanded by General Vandamme and any of the underlying units. How then can we determine the absolute level? Well, by counting the dots. To count the dots, the easiest thing to do is probably to start with suppressing them, with the help of the replace( ) function that you find in the SQL dialect of all major products. All you have to do next is subtract the length of the string without the dots from the length of the string with the dots, and you get exactly what you want, the dot-count: length((materialized_path) - length(replace(materialized_path, '.', '')) If we check the result of our dot-counting algorithm for the author of the epigraph that adorns Chapter 6 (a cavalry colonel at the time), here is what we get: SQL> select materialized_path, 2 length(materialized_path) len_w_dots, 3 length(replace(materialized_path, '.', '')) len_wo_dots, 4 length(materialized_path) - 5 length(replace(materialized_path, '.', '')) depth, 6 commander 7 from materialized_path_model 8 where commander = 'Colonel de Marbot' 9 / MATERIALIZED_PATH LEN_W_DOTS LEN_WO_DOTS DEPTH COMMANDER ----------------- ---------- ----------- ---------- ------------------ F.1.5.1.1 9 5 4 Colonel de Marbot Et voilà. 7.4.1.3. Nested sets modelFinding all the units under the command of General Vandamme is very easy under the nested sets model, since the model requires us to have numbered our nodes in such a way that the left_num and right_num of a node bracket are the left_num and right_num of all descendants. All we have to write is: select a.description, a.commander from nested_sets_model a, nested_sets_model b where a.left_num between b.left_num and b.right_num and b.commander = 'Général de Division Dominique Vandamme' All? Not quite. We have no indentation here. How do we get the level? Unfortunately, the only way we have to get the depth of a node (from which indentation is derived) is by counting how many nodes we have between that node and the root. There is no way to derive depth from left_num and right_num (in contrast to the materialized path model). If we want to display an indented list under the nested sets model, then we must join a third time with our nested_sets_model table, for the sole purpose of computing the depth: select lpad(description, length(description) + depth) description, commander from (select count(c.left_num) depth, a.description, a.commander, a.left_num from nested_sets_model a, nested_sets_model b, nested_sets_model c where a.left_num between c.left_num and c.right_num and c.left_num between b.left_num and b.right_num and b.commander = 'Général de Division Dominique Vandamme' group by a.description, a.commander, a.left_num) order by left_num The simple addition of the indentation requirement makes the query, as with (sic) the recursive with( ), somewhat illegible. 7.4.1.4. Comparing the Vandamme query under the various modelsAfter having checked that all queries were returning the same 40 rows properly indented, I then ran each of the queries 5,000 times in a loop (thus returning a total of 200,000 rows). I have compared the number of rows returned per second, taking the adjacency model as our 100-mark reference. You see the results in Figure 7-2. Figure 7-2. Performance comparison for the Vandamme queryAs Figure 7-2 shows, for the Vandamme query, the adjacency model, in which the tree is walked using connect by, outperforms the competition despite the procedural nature of connect by. The materialized path makes a decent show, but probably suffers from the function calls to compute the depth and therefore the indentation. The cost of a nicely indented output is even more apparent with the nested sets model, where the obvious performance killer is the computation of the depth through an additional join and a group by. One might cynically suggest that, since this model is totally hard-wired, static, and non-relational, we might as well go whole hog in ignoring relational design tenets and store the depth of each node relative to the root. Doing so would certainly improve our query's performance, but at a horrendous cost in terms of maintenance. 7.4.2. Bottom-Up Walk: The Highlanders QueryAs I said earlier, looking for the Highland string within the description attributes will necessarily lead to a full scan of the table. But let's write our query with each of the models in turn, and then we'll consider the resulting performance implications. 7.4.2.1. Adjacency modelThe Highlanders query is very straightforward to write using connect by, and once again we use the dynamically computed level pseudo-column to indent our result properly. Note that level was previously giving the depth, and now it returns the height since it is always computed from our starting point, and that we now return the parent after the child: select lpad(description, length(description) + level) description, commander from adjacency_model connect by id = prior parent_id start with description like '%Highland%' And here is the result that we get: DESCRIPTION COMMANDER ---------------------------------- ---------------------------------------- 2/73rd (Highland) Rgmt of Foot Lt-Colonel William George Harris 5th British Brigade Major-General Sir Colin Halkett 3rd Anglo-German Division Lt-General Count Charles von Alten I Corps Prince William of Orange The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of Wellington 1/71st (Highland) Rgmt of Foot Lt-Colonel Thomas Reynell British Light Brigade Major-General Frederick Adam 2nd Anglo-German Division Lt-General Sir Henry Clinton II Corps Lieutenant-General Lord Rowland Hill The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of Wellington 1/79th (Highland) Rgmt of Foot Lt-Colonel Neil Douglas 8th British Brigade Lt-General Sir James Kempt 5th Anglo-German Division Lt-General Sir Thomas Picton (d.18th) General Reserve Duke of Wellington The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of Wellington 1/42nd (Highland) Rgmt of Foot Colonel Sir Robert Macara (d.16th) 9th British Brigade Major-General Sir Denis Pack 5th Anglo-German Division Lt-General Sir Thomas Picton (d.18th) General Reserve Duke of Wellington The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of Wellington 1/92nd (Highland) Rgmt of Foot Lt-Colonel John Cameron 9th British Brigade Major-General Sir Denis Pack 5th Anglo-German Division Lt-General Sir Thomas Picton (d.18th) General Reserve Duke of Wellington The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of Wellington 25 rows selected. The non-relational nature of connect by appears plainly enough: our result is not a relation, since we have duplicates. The name of the Duke of Wellington appears eight times, but in two different capacities, five times (as many times as we have Highland regiments) as commander-in-chief, and three as commander of the General Reserve. Twiceonce as commander of the General Reserve and once as commander-in-chiefwould have been amply sufficient. Can we easily remove the duplicates? No we cannot, at least not easily. If we apply a distinct, the DBMS will sort our result to get rid of the duplicate rows and will break the hierarchical order. We get a result that somehow answers the question. But you can take it or leave it according to the details of your requirements. 7.4.2.2. Materialized path modelThe Highlanders query is slightly more difficult to write under the materialized path model . Identifying the proper rows and indenting them correctly is easy: select lpad(a.description, length(a.description) + mp_depth(b.materialized_path) - mp_depth(a.materialized_path)) description, a.commander from materialized_path_model a, materialized_path_model b where b.materialized_path like a.materialized_path || '%' and b.description like '%Highland%') However, we have two issues to solve:
Paradoxically, the second issue is the reason why we can solve the first one easily; since we shall have to find a means of correctly ordering anyway, adding a distinct will break nothing in this case. How can we order correctly? As usual, by using the materialized path as our sort key. By adding these two elements and pushing the query into the from clause so as to be able to sort by materialized_path without displaying the column, we get: select description, commander from (select distinct lpad(a.description, length(a.description) + mp_depth(b.materialized_path) - mp_depth(a.materialized_path)) description, a.commander, a.materialized_path from materialized_path_model a, materialized_path_model b where b.materialized_path like a.materialized_path || '%' and b.description like '%Highland%') order by materialized_path desc which displays: DESCRIPTION COMMANDER ---------------------------------- ---------------------------------------- 1/92nd (Highland) Rgmt of Foot Lt-Colonel John Cameron 1/42nd (Highland) Rgmt of Foot Colonel Sir Robert Macara (d.16th) 9th British Brigade Major-General Sir Denis Pack 1/79th (Highland) Rgmt of Foot Lt-Colonel Neil Douglas 8th British Brigade Lt-General Sir James Kempt 5th Anglo-German Division Lt-General Sir Thomas Picton (d.18th) General Reserve Duke of Wellington 1/71st (Highland) Rgmt of Foot Lt-Colonel Thomas Reynell British Light Brigade Major-General Frederick Adam 2nd Anglo-German Division Lt-General Sir Henry Clinton II Corps Lieutenant-General Lord Rowland Hill 2/73rd (Highland) Rgmt of Foot Lt-Colonel William George Harris 5th British Brigade Major-General Sir Colin Halkett 3rd Anglo-German Division Lt-General Count Charles von Alten I Corps Prince William of Orange The Anglo-Allied Army of 1815 Field Marshal Arthur Wellesley, Duke of Wellington 16 rows selected. This is a much nicer and more compact result than is achieved with the adjacency model. However, I should point out that a condition such as: where b.materialized_path like a.materialized_path || '%' where we are looking for a row in the table aliased by a, knowing the rows in the table aliased by b, is something that, generally speaking, may be slow because we can't make efficient use of the index on the column. What we would like to do, to make efficient use of the index, is the opposite, looking for b.materialized_path knowing a.materialized_path. There are ways to decompose a materialized path into the list of the materialized paths of the ancestors of the node (see Chapter 11), but that operation is not without cost. On our sample data, the query we have here was giving far better results than decomposing the material path so as to perform a more efficient join with the materialized path of each ancestor. However, this might not be true against several million rows. 7.4.2.3. Nested sets modelOnce again, what hurts this model is that the depth must be dynamically computed, and that computation is a rather heavy operation. Since the Highlanders query is a bottom-up query, we must take care not to display the artificial root node (easily identified by left_num = 1) that we have had to introduce. Moreover, I have had to hard-code the maximum depth (6) to be able to indent properly. In our display, top levels are more indented than bottom levels, which means that padding is inversely proportional to depth. Since the depth is difficult to get, defining the indentation as 6 - depth was the simplest way to achieve the required result. As with the materialized path model, we have to reorder anyway, so we have no scruple about applying a distinct to get rid of duplicate rows. Here's the query: select lpad(description, length(description) + 6 - depth) description, commander from (select distinct b.description, b.commander, b.left_num, (select count(c.left_num) from nested_sets_model c where b.left_num between c.left_num and c.right_num) depth from nested_sets_model a, nested_sets_model b where a.description like '%Highland%' and a.left_num between b.left_num and b.right_num and b.left_num > 1) order by left_num desc This query displays exactly the same result as does the materialized path query in the preceding section. 7.4.2.4. Comparing the various models for the Highlanders queryI have applied the same test to the Highlanders query as to the Vandamme query earlier, running each of the queries 5,000 times, with a minor twist: the adjacency model, as we have seen, returns duplicate rows that we cannot get rid of. My test returns 5,000 times 25 rows for the adjacency model, and 5,000 times 16 rows with the other models, because they are the only rows of interest. If we measure performance as a simple number of rows returned by unit of time, with the adjacency model we are also counting many rows that we are not interested in. I have therefore added an adjusted adjacency model, for which performance is measured as the number of rows of interestthe rows returned by the other two modelsper unit of time. The result is given in Figure 7-3. It is quite obvious from Figure 7-3 that the adjacency model outperforms the two other models by a very wide margin before adjustment, and still by a very comfortable margin after adjustment. Also notice that the materialized path model is still faster than the nested sets model, but only marginally so. Figure 7-3. Performance comparison for the Highlanders queryWe therefore see that, in spite of its procedural nature, the implementation of the connect by works rather well, both for top-down and bottom-up queries, provided of course that columns are suitably indexed. However, the return of duplicate rows in bottom-up queries when there are several starting points can prove to be a practical nuisance. When connect by or a recursive with is not available, the materialized path model makes a good substitute. It is interesting to see that it performs better than the totally hard-wired nested sets model. When designing tables to store hierarchical data, there are a number of mistakes to avoid, some of which are made in our example:
Walking trees, whether down from the root or up from a leaf node, is by nature a sequential and therefore slow operation. |