7.1. Tree StructuresRelational theory struck the final blow to hierarchical databases as the main repositories for structured data. Hierarchical databases were historically the first attempt at structuring data that had so far been stored as records in files. Instead of having linear sequences of identical records, various records were logically nested. Hierarchical databases were excellent for some queries, but their strong structure made one feel as if in a straitjacket, and navigating them was painful. They first bore the brunt of the assault by network, or CODASYL, databases, in which navigation was still difficult but that were more flexible, until the relational theory proved that database design was a science and not a craft. However, hierarchies, or at least hierarchical representations, are extremely commonwhich probably accounts for the resilience of the hierarchical model, still alive today under various names such as Lightweight Directory Access Protocol (LDAP) and XML . The handling of hierarchical data, also widely known as the Bill of Materials (BOM) problem, is not the simplest of problems to understand. Hierarchies are complicated not so much because of the representation of relationships between different components, but mostly because of the way you walk a tree. Walking a tree simply means visiting all or some of the nodes and usually returning them in a given order. Walking a tree is often implemented, when implemented at all, by DBMS engines in a procedural wayand that procedurality is a cardinal relational sin. 7.1.1. Tree Structures Versus Master/Detail RelationshipsMany designers tend, not unnaturally, to consider that a parent/child link is in itself not very different from a master/detail relationshipthe classical orders/order_detail relationship, in which the order_detail table stores (as part of its own key) the reference of the order it relates to. There are, however, at least four major differences between the parent/child link and the master/detail relationship:
In his excellent book, Practical Issues in Database Management (Addison Wesley), Fabian Pascal explains that the proper relational view of a tree is to understand that we have two distinct entity types, the nodes (for which we may have a special subtype of leaf nodes, bearing more information) and the links between the nodes. I should point out that this design approach solves the question of integrity constraints, since one only describes links that actually exist. Pascal's approach also solves the case of the "child" that appears in the descent of numerous "parents." This case is quite common in the industry and yet so rare in textbooks, which usually stick to the employee/manager example. Pascal, following ideas of Chris Date, suggests that there should be an explode( ) operator to flatten, on the fly, a hierarchy, by providing a view which would make explicit the implicit links between nodes. The only snag is that this operator has never been implemented. DBMS vendors have quite often implemented specialized processes such as the handling of spatial data or full-text indexing, but the proper implementation of hierarchical data has oscillated between the nonexistent and the feeble, thus leaving most of the burden of implementation just where it doesn't belong: with the developer. As I have already hinted, the main difficulty when dealing with hierarchical data lies in walking the tree. Of course, if your aim is just to display a tree structure in a graphical user interface, each time the user clicks on a node to expand it, you have no particular problem: issuing a query that returns all the children of the node for which you pass the identifier as argument is a straightforward task. 7.1.2. Practical Examples of HierarchiesIn real life, you meet hierarchies very often, but the tasks applied to them are rarely simple. Here are just three examples of real-life problems involving hierarchies, from different industries:
It is important to understand that these hierarchical problems are indeed quite distinct in their fundamental characteristics. A task such as finding the location of a file in an archive means walking a tree from the bottom to the top (that is, from a position of high granularity to one of increasing aggregation), because you start from some single file reference, that will point you to the folder in which it is stored, where you will find the identification of a box, and so forth on up to the room in a building, and so on, thus determining the exact location of the file. Finding all the products that contain a given ingredient also happens to be a bottom-up walk, although in that case our number of starting points may be very highand we have to repeat the walk each time. By contrast, risk exposure analysis means, first, a top-down walk to find all investments, followed by computations on the way back up to the top. It is a kind of aggregation, only more complicated. In general, the number of levels in trees tends to be rather small. This is, in fact, the main beauty of trees and the reason why they can be efficiently searched. If the number of levels is fixed, the only thing we have to do is to join the table containing a tree with itself as many times as we have levels. Let's take the case of archives and say that the inventory table shows us in which folder our loan file is located. This folder identifier will take us to a location table, that points us the identifier of the box which contains the folder, the shelf upon which the box is laid, the cabinet to which the shelf belongs, the alley where we can find this cabinet, the room which contains the alley, the floor on which the room is located, and, finally, the building. If the location table treats folders, boxes, shelves, and the like as generic "locations," a query returning all the components in the physical location of a file might look like this: select building.name building, floor.name floor, room.name room, alley.name alley, cabinet.name cabinet, shelf.name shelf, box.name box, folder.name folder from inventory, location folder, location box, location shelf, location cabinet, location alley, location room, location floor, location building where inventory.id = 'AZE087564609' and inventory.folder = folder.id and folder.located_in = box.id and box.located_in = shelf.id and shelf.located_in = cabinet.id and cabinet.located_in = alley.id and alley.located_in = room.id and room.located_in = floor.id and floor.located_in = building.id This type of query, in spite of an impressive number of joins, should run fast since each successive join will use the unique index on location (that is, the index on id), presumably the primary key. But yes, there is a catch: the number of levels in a hierarchy is rarely constant. Even in the rather sedate world of archives, the contents of boxes are often moved after the passage of time to new containers (which may be more compact and, therefore, provide cheaper storage). Such activity may well replace two levels in a hierarchy with just one, as containers will replace both boxes and shelves. What should we do when we don't know the number of levels? How best do we query such a hierarchy? Do we use a union? An outer-join? Links between objects of the same nature should be modeled as trees as soon as the number of levels between two objects is no longer a constant. |