Section 4.2. Five Factors Governing the Art of SQL

4.2. Five Factors Governing the Art of SQL

You have seen in the first part of this chapter exactly how SQL includes both relational and non-relational characteristics. You have also seen how this affects the efficient (and not-so-efficient) workings of the database optimizer. From this point forward, and bearing in mind the lessons of the first part of this chapter, we can concentrate on the key factors that must be considered when using SQL. In my view, there are five main factors:

The total quantity of data from which a result set has to be obtained
The criteria required to define the result set
The size of the result set
The number of tables to be processed in order to obtain the desired result set
The number of other users also modifying this same data

4.2.1. Total Quantity of Data

The volume of data we need to read is probably the most important factor to take into account; an execution plan that is perfectly suitable for a fourteen-row emp table and a four-row dept table may be entirely inappropriate for dealing with a 15 million-row financial_flows table against which we have to join a 5 million-row products table. Note that even a 15 million-row table will not be considered particularly large by the standards of many companies. As a matter of consequence, it is hard to pronounce on the efficiency of a query before having run it against the target volume of data.

4.2.2. Criteria Defining the Result Set

When we write an SQL statement, in most cases it will involve filtering conditions located in where clauses, and we may have several where clausesa major one as well as minor onesin subqueries or views (regular views or in-line views). A filtering condition may be efficient or inefficient. However, the significance of efficient or inefficient is strongly affected by other factors, such as physical implementation (as discussed in Chapter 5) and once again, by how much data we have to wade through.

We need to approach the subject of defining the result in several parts, by considering filtering, the central SQL statements, and the impact of large data volumes on our queries. But this is a particularly complex area that needs to be treated in some depth, so I'll reserve this discussion until later in this chapter, in the major section entitled "Filtering."

4.2.3. Size of the Result Set

An important and often overlooked factor is how much data a query returns (or how much data a statement changes). This is often dependent on the size of the tables and the details of the filtering, but not in every case. Typically, the combination of several selection criteria which of themselves are of little value in selecting data may result in highly efficient filtering when used in combination with one another. For example, one could cite that retrieving students' names based on whether they received a science or an arts degree will give a large result set, but if both criteria are used (e.g., students who studied under both disciplines) the consequent result set will collapse to a tiny number.

In the case of queries in particular, the size of the result set matters not so much from a technical standpoint, but mostly because of the end user's perception. To a very large extent, end users adjust their patience to the number of rows they expect: when they ask for one needle, they pay little attention to the size of the haystack. The extreme case is a query that returns nothing, and a good developer should always try to write queries that return few or no rows as fast as possible. There are few experiences more frustrating than waiting for several minutes before finally seeing a "no data found" message. This is especially annoying if you have mistyped something, realized your error just after hitting Enter, and then have been unable to abort the query. End users are willing to wait to get a lot of data, but not to get an empty result. If we consider that each of our filtering criteria defines a particular result set and the final result set is either the intersection (when conditions are anded) or the union (when conditions are ored together) of all the intermediate result sets , a zero result is most likely to result from the intersection of small, intermediate result sets. In other words, the (relatively) most precise criteria are usually the primary reason for a zero result set. Whenever there is the slightest possibility that a query might return no data, the most likely condition that would result in a null return should be checked firstespecially if it can be done quickly. Needless to say, the order of evaluation of criteria is extremely context-sensitive as you shall see later under "Filtering."

A skillful developer should aim for response times proportional to the number of rows returned.

4.2.4. Number of Tables

The number of tables involved in a query will naturally have some influence on performance. This is not because a DBMS engine performs joins badlyon the contrary, modern systems are able to join large numbers of tables very efficiently.

4.2.4.1. Joins

The perception of poor join performance is another enduring myth associated with relational databases. Folklore has it that one should not join too many tables, with five often suggested as the limit. In fact, you can quite easily have 15 table joins perform extremely well. But there are additional problems associated with joining a large number of tables, of which the following are examples:

When you routinely need to join, say, 15 tables, you can legitimately question the correctness of the design; keep in mind what I said in Chapter 1that a row in a table states some kind of truth and can be compared to a mathematical axiom. By joining tables, we derive other truths. But there is a point at which we must decide whether something is an obvious truth that we can call an axiom, or whether it is a less obvious truth that we must derive. If we spend much of our time deriving our truths, perhaps our axioms are poorly chosen in the first place.
For the optimizer, the complexity increases exponentially as the number of tables increases. Once again, the excellent work usually performed by a statistical optimizer may comprise a significant part of the total response time for a query, particularly when the query is run for the first time. With large numbers of tables, it is quite impractical for the optimizer to explore all possible query paths. Unless a query is written in a way that eases the work of the optimizer, the more complex the query, the greater the chance that the optimizer will bet on the wrong horse.
When we write a complex query involving many tables, and when joins can be written in several fairly distinct ways, the odds are high that we'll pick the wrong construct. If we join tables A to B to C to D, the optimizer may not have all the information present to know that A can be very efficiently joined directly to D, particularly if that join happens to be a special case. A sloppy developer trying to fix duplicate rows with a distinct can also easily overlook a missing join condition.

4.2.4.2. Complex queries and complex views

Be aware that the apparent number of tables involved in a query can be deceptive; some of the tables may actually be views, and sometimes pretty complex ones, too. Just as with queries, views can also have varying degrees of complexity. They can be used to mask columns, rows, or even a combination of rows and columns to all but a few privileged users. They can also be used as an alternate perspective on the data, building relations that are derived from the existing relations stored as tables. In cases such as these, a view can be considered shorthand for a query, and this is probably one of the most common usages of views. With increasingly complex queries, there is a temptation to break a query down into a succession of individual views, each representing a component of the greater query.

The simplicity of a given query may hide the complexity of participating views.

Like most extreme positions, it would be absurd to banish views altogether. Many of them are rather harmless animals. However, when a view is itself used in a rather complex query, in most cases we are only interested in a small fraction of the data returned by the viewpossibly in a couple of columns, out of a score or more. The optimizer may attempt to recombine a simple view into a larger query statement. However, once a query reaches a relatively modest level of complexity, this approach may become too complex in itself to enable efficient processing.

In some cases a view may be written in a way that effectively prevents the optimizer from combining it into the larger statement. I have already mentioned rownums, those virtual columns used in Oracle to indicate the order in which rows are initially found. When rownums are used inside a view, a further level of complexity is introduced. Any attempt to combine a view that references a rownum into a larger statement would be almost guaranteed to change the subsequent rownum order, and therefore the optimizer doesn't permit a query rewrite in those circumstances. In a complicated query, such a view will necessarily be executed in isolation. In quite a number of cases then, the DBMS optimizer will push a view as is into a statement,^[*] running it as a step in the statement execution, and using only those elements that are required from the result of the view execution.

^[*] The optimizer may also sometimes push criteria down into the view.

Frequently, many of the operations executed in a view (typically joins to return a description associated with codes) will be irrelevant in the context of a larger query, or a query may have special search criteria that would have been particularly selective when applied to the tables underlying the view. For instance, a subsequent union may prove to be totally unnecessary because the view is the union of several tables representing subtypes, and the larger query filters on only one of the subtypes. There is also the danger of joining a view with a table that itself appears in the same view, thus forcing multiple passes over this table and probably hitting the same rows several times when one pass would have been quite sufficient.

When a view returns much more data than required in the context of a query that references that view, dramatic performance gains can often be obtained by eliminating the view (or using a simpler version of the view). Begin by replacing the view reference in the main query with the underlying SQL query used to define the view. With the components of the view in full sight, it becomes easy to remove everything that is not strictly necessary. More often than not, it's precisely what isn't necessary that prevents the view from being merged by the optimizer, and a simpler, cut-down view may give excellent results. When the query is correctly reduced to its most basic components, it runs much faster.

Many developers may hesitate to push the code for a very complex view into an already complex query, not least because it can make a complex situation even more complicated. The exercise of developing and factoring a complex SQL expression may indeed appear to be daunting. It is, however, an exercise quite similar to the development of mathematical expressions, as practiced in high school. It is, in my view, a very formative exercise and well worth the effort of mastering. It is a discipline that provides a very sound understanding of the inner workings of a query for developers anxious to improve their skills, and in most cases the results can be highly rewarding.

Rather than embedding a view inside a query when that view returns unnecessary elements, try to decompose the view components into the main query body.

4.2.5. Number of Other Users

Finally, concurrency is a factor that you must carefully take into account when designing your SQL code. Concurrency is usually a concern while writing to the database where block-access contention , locking , latching (which means locking of internal DBMS resources), and others are the more obvious problem areas; even read consistency can in some cases lead to some degree of contention. Any server, no matter how impressive its specification, will always have a finite capacity. The ideal plan for a query running on a machine with little to no concurrency is not necessarily the same as the ideal plan for the same query running on the same machine with a high level of concurrency. Sorts may no longer find the memory they need and may instead resort to writing to disk, thus creating a new source of contention. Some CPU-intensive operationsfor example, the computation of complicated functions, repetitive scanning of index blocks, and so forthmay cause the computer to overload. I have seen cases in which more physical I/Os resulted in a significantly better time to perform a given task. In those cases, there was a high level of concurrency for CPU-intensive operations, and when some processes had to wait for I/Os, the overworked CPUs were relieved and could run other processes, thus ensuring a better overlap. We must often think in terms of global throughput of one particular business task, rather than in terms of individual user response-time.

Chapter 9 examines concurrency in greater detail.