7.2. Representing Trees in an SQL DatabaseTrees are generally represented in the SQL world by one of three models:
There is a fourth, less well-known model, presented by its author, Vadim Tropashko, who calls it the nested interval model, in a very interesting series of papers.[*] The idea behind this model is, to put it very simply, to encode the path of a given node with two numbers, which are interpreted as the numerator and the denominator of a rational number (a fraction to those uncomfortable with the vocabulary of mathematics) instead of an interval. Unfortunately, this model is heavy on computations and stored procedures and, while it looks promising for a future implementation of good tree-handling functions (perhaps the explode( ) operator) in a DBMS, it is in practice somewhat difficult to implement and not the fastest you can do, which is why I shall focus on the three aforementioned models.
To keep in tone with our general theme, and to generate a reasonable amount of data, I have created a test database of the organizations of the various armies that were opposed in 1815 at the famous battle of Waterloo in Belgium, near Brussels[] (known as [] Using, with his permission, the data compiled by Peter Kessler, at http://www.kessler-web.co.uk. I must hasten to say that the point of what follows in this chapter is to demonstrate various ways to walk hierarchies and that the design of my tables is, to say the least, pretty slack. Typically, a proper primary key for a fighting unit should be an understandable and standardized code, not a description that may suffer from data entry errors. Please understand that any reference to a surrogate id is indeed shorthand for an implicit, sound primary key. The main difficulty with hierarchies is that there is no "best representation." When our interest is mostly confined to the ancestors of a few elements (a bottom-up walk), either connect by or the recursive with is, at least functionally and in terms of performance, sufficiently satisfactory. However, if we scratch the surface, connect by in particular is of course a somewhat ugly, non-relational, procedural implementation, in the sense that we can only move gradually from one row to the next one. It is much less satisfactory when we want to return either a bottom-up hierarchy for a very large number of items, or when we need to return a very large number of descendants in a top-down walk. As is so often the case with SQL, the ugliness that you can hide with a 14-row table becomes painfully obvious when you are dealing with millions, not to say billions, of rows, and that nice little SQL trick now shows its limits in terms of performance. My example table, which contains a little more than 800 rows, is a bit larger than the usual examples, although it is in no way comparable to what you can regularly find in the industry. However, it is big enough to point out the strengths and weaknesses of the various models. The SQL implementation of trees is DBMS dependent; use what your DBMS has to offer. |