Section 7.2. Representing Trees in an SQL Database

7.2. Representing Trees in an SQL Database

Trees are generally represented in the SQL world by one of three models:

Adjacency model: The adjacency model is thus called because the identifier of the closest ancestor up in the hierarchy (the parent row) is given as an attribute of the child row. Two adjacent nodes in the tree are therefore clearly associated. The adjacency model is often illustrated by the employee number of the manager being specified as an attribute of each employee managed. (The direct association of the manager to the employee is in truth a poor design, because the manager identification should be an attribute of the structure that is managed. There is no reason that, when the head of a department is changed, one should update the records of all the people who work in the department to indicate the new manager). Some products implement special operators for dealing with this type of model, such as Oracle's connect by (introduced as early as Oracle version 4 around the mid 1980s) or the more recent recursive with statement of DB2 and SQL Server. Without any such operator, the adjacency model is very hard to manage.
Materialized path model: The idea here is to associate with each node in the tree a representation of its position within the tree. This representation takes the form of a concatenated list of the identifiers of all the node's ancestors, from the root of the tree down to its immediate parent, or as a list of numbers indicating the rank within siblings of a given ancestor at one generation (a method frequently used by genealogists). These lists are usually stored as delimited strings. For instance, '1.2.3.2' means (right to left) that the node is the second child of its parent (the path of which is '1.2.3'), which itself is the third child of the grandparent ('1.2'), and so forth.
Nested set model: In this model, devised by Joe Celko,^[*] a pair of numbers (defined as a left number and a right number) is associated to each node in such a fashion that they define an interval which always contains the interval associated with any of the descendents. The upcoming subsection "Nested Sets Model (After Celko)" under "Practical Implementation of Trees" gives a practical example of this intricate scheme.
^[*] First introduced in articles in DBMS Magazine (circa 1996), and much later developed in Trees and Hierarchies in SQL for Smarties (Morgan-Kauffman).

There is a fourth, less well-known model, presented by its author, Vadim Tropashko, who calls it the nested interval model, in a very interesting series of papers.^[*] The idea behind this model is, to put it very simply, to encode the path of a given node with two numbers, which are interpreted as the numerator and the denominator of a rational number (a fraction to those uncomfortable with the vocabulary of mathematics) instead of an interval. Unfortunately, this model is heavy on computations and stored procedures and, while it looks promising for a future implementation of good tree-handling functions (perhaps the explode( ) operator) in a DBMS, it is in practice somewhat difficult to implement and not the fastest you can do, which is why I shall focus on the three aforementioned models.

^[*] Initially published on http://www.dbazine.com.

To keep in tone with our general theme, and to generate a reasonable amount of data, I have created a test database of the organizations of the various armies that were opposed in 1815 at the famous battle of Waterloo in Belgium, near Brussels^[] (known as ^[] Using, with his permission, the data compiled by Peter Kessler, at http://www.kessler-web.co.uk.

I must hasten to say that the point of what follows in this chapter is to demonstrate various ways to walk hierarchies and that the design of my tables is, to say the least, pretty slack. Typically, a proper primary key for a fighting unit should be an understandable and standardized code, not a description that may suffer from data entry errors. Please understand that any reference to a surrogate id is indeed shorthand for an implicit, sound primary key.

The main difficulty with hierarchies is that there is no "best representation." When our interest is mostly confined to the ancestors of a few elements (a bottom-up walk), either connect by or the recursive with is, at least functionally and in terms of performance, sufficiently satisfactory. However, if we scratch the surface, connect by in particular is of course a somewhat ugly, non-relational, procedural implementation, in the sense that we can only move gradually from one row to the next one. It is much less satisfactory when we want to return either a bottom-up hierarchy for a very large number of items, or when we need to return a very large number of descendants in a top-down walk. As is so often the case with SQL, the ugliness that you can hide with a 14-row table becomes painfully obvious when you are dealing with millions, not to say billions, of rows, and that nice little SQL trick now shows its limits in terms of performance.

My example table, which contains a little more than 800 rows, is a bit larger than the usual examples, although it is in no way comparable to what you can regularly find in the industry. However, it is big enough to point out the strengths and weaknesses of the various models.

The SQL implementation of trees is DBMS dependent; use what your DBMS has to offer.