4.1. The Nature of SQLBefore we begin examining query constructs in detail, we need to review some of the general characteristics of SQL itself: how it relates to the database engine and the associated optimizer, and what may limit the efficiency of the optimizer. 4.1.1. SQL and DatabasesRelational databases owe their existence to pioneering work by E.F. Codd on the relational theory. From the outset, Codd's work provided a very strong mathematical basis to what had so far been a mostly empirical discipline. To make an analogy, for thousands of years mankind has built bridges to span rivers, but frequently these structures were grossly overengineered simply because the master builders of the time didn't fully understand the true relationships between the materials they used to build their bridges, and the consequent strengths of these bridges. Once the science of civil engineering developed a solid theoretical knowledge of material strengths, bridges of a far greater sophistication and safety began to emerge, demonstrating the full exploitation of the various construction materials being used. Indeed, the extraordinary dimensions of some modern bridges reflect the similarly huge increase in the data volumes that modern DBMS software is able to address. Relational theory has done for databases what civil engineering has done for bridges. It is very common to find confusion between the SQL language, databases, and the relational model. The function of a database is primarily to store data according to a model of the part of the real world from which that data has been obtained. Accordingly, a database must provide a solid infrastructure that will allow multiple users to make use of that same data, without, at any time, prejudicing the integrity of that data when they change it. This will require the database to handle contention between users and, in the extreme case, to keep the data consistent if the machine were to fail in mid-transaction. The database must also perform many other functions outside the scope of this book. As its name says, Structured Query Language, or SQL for short, is nothing other than a language, though admittedly with a very tight coupling to databases. Equating the SQL language with relational databases or even worse with the relational theoryis as misguided as assuming that familiarity with a spreadsheet program or a word processor is indicative of having mastered "information technology." In fact, some products that are not databases support SQL,[*] and before becoming a standard SQL had to compete against other languages such as RDO or QUEL, which were considered by many theorists to be superior to SQL.
Whenever you have to solve what I shall generically call an SQL problem, you must realize that there are two components in action: the SQL expression of the query and the database optimizer. These two components interact within three distinct zones, as shown in Figure 4-1. At the center lies the relational theory , where mathematicians freely roam. If we simplify excessively, we can say that (amongst other useful things) the theory informs us that we can retrieve data that satisfies some criteria by using a handful of relational operators, and that these operators will allow us to answer basically any question. Most importantly, because the relational theory is so firmly grounded in mathematics, we can be totally confident that relational expressions can be written in different ways and yet return the same result. In exactly the same way, arithmetic teaches us that 246/369 is exactly the same as 2/3. Figure 4-1. DBMS ProtagonistsHowever, despite the crucial theoretical importance of relational theory, there are aspects of great practical relevance that the relational theory has nothing to say about. These fall into an area I call "reporting requirements ." The most obvious example in this area is the ordering of result sets. Relational theory is concerned only with the retrieval of a correct data set, as defined by a query. As we are practitioners and not theorists, for us the relational phase consists in correctly identifying the rows that will belong to our final result set. The matter of how some attributes (columns) of one row relate to similar attributes in another row doesn't belong to this phase, and yet this is what ordering is all about. Further, relational theory has nothing to say about the numerous statistical functions (such as percentiles and the like) that often appear in various dialects of the SQL language. The relational theory operates on set, and knows nothing of the imposition of ordering on these sets. Despite the fact that there are many mathematical theories built around ordering, none have any relevance to the relational theory. At this stage I must point out that what distinguishes relational operations from what I have called reporting requirements is that relational operations apply to mathematical sets of theoretically infinite extent. Irrespective of whether we are operating on tables of 10, one million, or one billion rows, we can apply any filtering criterion in an identical fashion. Once again, we are concerned only with identifying and returning the data that matches our criteria. Here, we are in the environment where the relational theory is fully applicable. Now, when we want to order rows (or perform an operation such as group by that most people would consider a relational operation) we are no longer working on a potentially infinite data set, but on a necessarily finite set. The consequent data set thus ceases to be a relation in the mathematical sense of the word. We are outside the bounds of the relational theory. Of course, this doesn't mean that we cannot still do clever and useful things against this data using SQL. So we may, as a first approximation, represent an SQL query as a double-layered operation as shown in Figure 4-2; first, a relational core identifying the set of data we are going to operate on, second, a non-relational layer which works on this now finite set to give the polishing touch and produce the final result that the user expects. Figure 4-2. The various layers of an SQL queryDespite Figure 4-2's appealingly simple representation of the place of SQL within the data environment, an SQL query will in most cases be considerably more complex than Figure 4-2 may suggest; Figure 4-2 only represents the overall pattern. The relational filter may be a generic name for several independent filters combined, for instance, through a union construct or by the means of subqueries, and the complexity of some SQL constructs can be considerable. I shall come back to the topic of SQL code a little later. But first I must talk about the relationship between the physical implementation of data and the database optimizer. Do not confuse the true relational functionality of the SQL query execution with the additional presentation layer. 4.1.2. SQL and the OptimizerAn SQL engine that receives a query to process will have to use the optimizer to find out how to execute that query in the most efficient way possible. Here the relational theory strikes again, because that theory informs the optimizer of transformations that are valid equivalents of the semantically correct query initially provided by the developereven if that original query was clumsily written. Optimization is when the physical implementation of data comes into play. Depending on the existence of indexes and their usability in relation to a query, some transformations may result in much faster execution than other semantically equivalent transformations. Various storage models that I introduce in Chapter 5 may also make one particular way to execute a query irresistibly attractive. The optimizer examines the disposition of the indexes that are available, the physical layout of data, how much memory is available, and how many processors are available to be applied to the task of executing the query. The optimizer will also take into account information concerning the volume of the various tables and indexes that may be involved, directly or indirectly, through views used by the query. By weighing the alternatives that theory says are valid equivalents against the possibilities allowed by the implementation of the database, the optimizer will generate what is, hopefully, the best execution plan for the query. However, the key point to remember is that, although the optimizer may not always be totally weaponless in the non-relational layer of an SQL query, it is mainly in the relational core that it will be able to deploy its full powerprecisely because of the mathematical underpinnings of the relational theory. The transformation from one SQL query to another raises an important point: it reminds us that SQL is supposed to be a declarative language . In other words, one should use SQL to express what is required, rather than how that requirement is to be met. Going from what to how, should, in theory, be the work of the optimizer. You saw in Chapters 1 and 2 that SQL queries are only some of the variables in the equation; but even at the tactical query level, a poorly written query may prevent the optimizer from working efficiently. Remember, the mathematical basis of the relational theory provides an unassailable logic to the proceedings. Therefore, part of the art of SQL is to minimize the thickness, so to speak, of the non-relational layeroutside this layer, there is not much that the optimizer can safely do that guarantees returning exactly the same rows as the original query. Another part of the art of SQL is that when performing non-relational operationsloosely defined as operations for which the whole (at least at this stage) resulting dataset is knownwe must be extremely careful to operate on only the data that is strictly required to answer the original question, and nothing more. Somehow, a finite data set, as opposed to the current row, has to be stored somewhere, and storing anything in temporary storage (memory or disk) requires significant overhead due to byte-pushing. This overhead may dramatically increase as the result set data volumes themselves increase, particularly if main memory becomes unavailable. A shortage of main memory would initiate the high-resource activity of swapping to disk, with all its attendant overheads. Moreover, always remember that indexes refer to disk addresses, not temporary storageas soon as the data is in temporary storage, we must wave farewell to most fast access methods (with the possible exception of hashing). Some SQL dialects mislead users into believing that they are still in the relational world when they have long since left it. Take as a simple example the query "Who are the five top earners among employees who are not executives?"a reasonable real-life question, although one that includes a distinctly non-relational twist. Identifying employees who are not executives is the relational part of the query, from which we obtain a finite set of employees that we can order. Several SQL dialects allow one to limit the number of rows returned by adding a special clause to the select statement. It is then fairly obvious that both the ordering and the limitation criteria are outside the relational layer. However, other dialects, the Oracle version figuring prominently here, use other mechanisms. What Oracle has is a dummy column named rownum that applies a sequential numbering to the rows in the order in which they are returnedwhich means the numbering is applied during the relational phase. If we write something such as: select empname, salary from employees where status != 'EXECUTIVE' and rownum <= 5 order by salary desc we get an incorrect result, at least in the sense that we are not getting the top five most highly paid nonexecutives, as the query might suggest at first glance. Instead, we get back the first five nonexecutives foundthey could be the five lowest paid!ordered in descending order of salary. (This query illustrates a well-known trap among Oracle practitioners, who have all been burnt at least once.) Let's just be very clear about what is happening with the preceding query. The relational component of the query simply retrieves the first five rows (attributes empname and salary only) from the table employees where the employee is not an executive in a totally unpredictable order. Remember that relational theory tells us that a relation (and therefore the table that represents it) is not defined in any way by the order in which tuples (and therefore the rows in that table) are either stored or retrieved. As a consequence the nonexecutive employee with the highest salary may or may not be included in this result setand there is no way we will ever know whether this result set actually meets our search criteria correctly. What we really want is to get all nonexecutives, order them by decreasing salary, and only then get the top five in the set. We can achieve this objective as follows: select * from (select empname, salary from employees where status != 'EXECUTIVE' order by salary desc) where rownum <= 5 So, how is our query layered in this case? Many would be tempted to say that by applying a filtering condition to an ordered result, we end up with something looking more or less like Figure 4-3. Figure 4-3. A misleading view of what the "top five nonexecutives" query looks likeThe truth, however, is more like Figure 4-4. Using constructs that look relational doesn't take us back to the relational world, because to be in the relational world we must apply relational operators to relations. Our subquery uses an order by to sort the results. Once we've imposed ordering, we no longer have, strictly speaking, a relation (a relation is a set, and a set has no order). We end up with an outer select that looks relational on the surface but is applied to the output of an inline view in which a significant component (the order by clause) is not a relational process. Figure 4-4. What the "top five nonexecutives" query is really likeMy example of the top five nonexecutives is, of course, a simple example, but do understand that once we have left the relational sphere in the execution of a query, we can no longer return to it. The best we can possibly do is to use the output of such a query to feed into the relational phase of an outer query. For instance, "in which departments are our five top nonexecutive earners working?" What is extremely important to understand, though, is that at this stage no matter how clever the optimizer is, it will be absolutely unable to combine the queries, and will more or less have to execute them in a given sequence. Further, any resulting set from an intermediate query is likely to be held in temporary storage, whether in memory or on disk, where the choice of access methods may be reduced. Once outside the pure relational layer, the way we write a query is of paramount importance for performance because it will inevitably impose onto the query some execution path from which the SQL engine will not be able to stray. To summarize, we can say that the safest approach we can adopt is to try to do as much of the job as possible inside the relational layer, where the optimizer can operate to maximum efficiency. When the situation is such that a given SQL task is no longer a purely relational problem, then we must be particularly careful about the construct, or the writing of the query itself. Understanding that SQL has, like Dr. Jekyll, a double nature is the key to mastering the language. If you see SQL as a single-edged sword, then you are condemned to remain in the world of tips and tricks for dummies, smarties, and mere mortals, possibly useful for impressing the opposite sexalthough in my experience it doesn't work muchbut an approach that will never provide you with a deep understanding of how to cope with a difficult SQL problem. The optimizer rewards those who do the most work in the relational layer. 4.1.3. Limits of the OptimizerAny decent SQL engine relies heavily on its query optimizer, which very often performs an excellent job. However, there are many aspects of the way optimizers work that you must keep in mind:
Feed the optimizer with little chunks, and it will optimize little pieces. Feed it with a big chunk, and it will optimize a task. |