Section 7.1. Tree Structures

7.1. Tree Structures

Relational theory struck the final blow to hierarchical databases as the main repositories for structured data. Hierarchical databases were historically the first attempt at structuring data that had so far been stored as records in files. Instead of having linear sequences of identical records, various records were logically nested. Hierarchical databases were excellent for some queries, but their strong structure made one feel as if in a straitjacket, and navigating them was painful. They first bore the brunt of the assault by network, or CODASYL, databases, in which navigation was still difficult but that were more flexible, until the relational theory proved that database design was a science and not a craft. However, hierarchies, or at least hierarchical representations, are extremely commonwhich probably accounts for the resilience of the hierarchical model, still alive today under various names such as Lightweight Directory Access Protocol (LDAP) and XML .

The handling of hierarchical data, also widely known as the Bill of Materials (BOM) problem, is not the simplest of problems to understand. Hierarchies are complicated not so much because of the representation of relationships between different components, but mostly because of the way you walk a tree. Walking a tree simply means visiting all or some of the nodes and usually returning them in a given order. Walking a tree is often implemented, when implemented at all, by DBMS engines in a procedural wayand that procedurality is a cardinal relational sin.

7.1.1. Tree Structures Versus Master/Detail Relationships

Many designers tend, not unnaturally, to consider that a parent/child link is in itself not very different from a master/detail relationshipthe classical orders/order_detail relationship, in which the order_detail table stores (as part of its own key) the reference of the order it relates to. There are, however, at least four major differences between the parent/child link and the master/detail relationship:

Single table: The first difference is that when we have a tree representing a hierarchy , all the nodes are of the very same nature. The leaf nodes , in other words the nodes that have no child node, are sometimes different, as happens in file management systems with foldersregular nodes and filesleaf nodes, but I'll set that case apart for the time being. Since all nodes are of the same nature, we describe them in the same way, and they will be represented by rows in the same table. Putting it another way, we have a kind of master/detail relationship, not between two different tables holding rows of different nature, but between a table and itself.
Depth: The second difference is that in the case of a hierarchy, how far you are from the top is often significant information. In a master/detail relationship, you are always either the master or the detail.
Ownership: The third difference is that in a master/detail relationship you can have a clean foreign key integrity constraint; for instance, every order identifier in the order_detail table must correspond to an existing identifier in the orders table, plain and simple. Such is not the case with hierarchical data. You can decide to say that, for instance, the manager number must refer to an existing employee number. Except that you then have a problem with the top manager, who in truth reports to the representatives of shareholdersthe board, not an employee. This leaves us with that endless source of difficulties, a null value. And you may have several such "special case" rows, since we may need to describe in the same table several independent trees, each with its own rootsomething that is called a forest.
Multiple parents: Associating a "child" with the identifier of a "parent" assumes that a child can have only one parent. In fact, there are many real-life situations when this is not the case, whether it is investments, ingredients in formulae, or screws in mechanical parts. A case when a child has multiple parents is arguably not a tree in the mathematical sense; unfortunately, many real-life trees, including genealogical trees, are more complex than simple parent-child relationships, and may even require the handling of special cases (outside the scope of this book) such as cycles in a line of links.

In his excellent book, Practical Issues in Database Management (Addison Wesley), Fabian Pascal explains that the proper relational view of a tree is to understand that we have two distinct entity types, the nodes (for which we may have a special subtype of leaf nodes, bearing more information) and the links between the nodes. I should point out that this design approach solves the question of integrity constraints, since one only describes links that actually exist. Pascal's approach also solves the case of the "child" that appears in the descent of numerous "parents." This case is quite common in the industry and yet so rare in textbooks, which usually stick to the employee/manager example.

Pascal, following ideas of Chris Date, suggests that there should be an explode( ) operator to flatten, on the fly, a hierarchy, by providing a view which would make explicit the implicit links between nodes. The only snag is that this operator has never been implemented. DBMS vendors have quite often implemented specialized processes such as the handling of spatial data or full-text indexing, but the proper implementation of hierarchical data has oscillated between the nonexistent and the feeble, thus leaving most of the burden of implementation just where it doesn't belong: with the developer.

As I have already hinted, the main difficulty when dealing with hierarchical data lies in walking the tree. Of course, if your aim is just to display a tree structure in a graphical user interface, each time the user clicks on a node to expand it, you have no particular problem: issuing a query that returns all the children of the node for which you pass the identifier as argument is a straightforward task.

7.1.2. Practical Examples of Hierarchies

In real life, you meet hierarchies very often, but the tasks applied to them are rarely simple. Here are just three examples of real-life problems involving hierarchies, from different industries:

Risk exposure: When you attempt to compute your exposure to risk in a financial structure such as a hedge fund, the matter becomes hierarchically complex. These financial structures invest in funds that themselves may hold shares in other funds.
Archive location: If you are a big retail bank, you are likely to face a nontrivial task if you want to retrieve from your archives the file of a loan signed by John Doe seven years ago, because files are stored in folders, which are in boxes, which are on shelves, which are in cabinets in an alley in some room of some floor of some building. The nested "containers" (folders, boxes, shelves, etc.) form a hierarchy.
Use of ingredients: If you work for the pharmaceutical industry, identifying all of the drugs you manufacture that contain an ingredient for which a much cheaper equivalent has just been approved and can now be used presents the very same type of SQL challenge in a totally unrelated area.

It is important to understand that these hierarchical problems are indeed quite distinct in their fundamental characteristics. A task such as finding the location of a file in an archive means walking a tree from the bottom to the top (that is, from a position of high granularity to one of increasing aggregation), because you start from some single file reference, that will point you to the folder in which it is stored, where you will find the identification of a box, and so forth on up to the room in a building, and so on, thus determining the exact location of the file. Finding all the products that contain a given ingredient also happens to be a bottom-up walk, although in that case our number of starting points may be very highand we have to repeat the walk each time. By contrast, risk exposure analysis means, first, a top-down walk to find all investments, followed by computations on the way back up to the top. It is a kind of aggregation, only more complicated.

In general, the number of levels in trees tends to be rather small. This is, in fact, the main beauty of trees and the reason why they can be efficiently searched. If the number of levels is fixed, the only thing we have to do is to join the table containing a tree with itself as many times as we have levels. Let's take the case of archives and say that the inventory table shows us in which folder our loan file is located. This folder identifier will take us to a location table, that points us the identifier of the box which contains the folder, the shelf upon which the box is laid, the cabinet to which the shelf belongs, the alley where we can find this cabinet, the room which contains the alley, the floor on which the room is located, and, finally, the building. If the location table treats folders, boxes, shelves, and the like as generic "locations," a query returning all the components in the physical location of a file might look like this:

     select building.name building,            floor.name floor,            room.name room,            alley.name alley,            cabinet.name cabinet,            shelf.name shelf,            box.name box,            folder.name folder     from inventory,          location folder,          location box,          location shelf,          location cabinet,          location alley,          location room,          location floor,          location building     where inventory.id = 'AZE087564609'       and inventory.folder = folder.id       and folder.located_in = box.id       and box.located_in = shelf.id       and shelf.located_in = cabinet.id       and cabinet.located_in = alley.id       and alley.located_in = room.id       and room.located_in = floor.id       and floor.located_in = building.id

This type of query, in spite of an impressive number of joins, should run fast since each successive join will use the unique index on location (that is, the index on id), presumably the primary key. But yes, there is a catch: the number of levels in a hierarchy is rarely constant. Even in the rather sedate world of archives, the contents of boxes are often moved after the passage of time to new containers (which may be more compact and, therefore, provide cheaper storage). Such activity may well replace two levels in a hierarchy with just one, as containers will replace both boxes and shelves. What should we do when we don't know the number of levels? How best do we query such a hierarchy? Do we use a union? An outer-join?

Links between objects of the same nature should be modeled as trees as soon as the number of levels between two objects is no longer a constant.