10.1. Increasing VolumesSome applications see the volume of data they handle increase in considerable proportion over time. In particular, any application that requires keeping online, for regulatory or business analysis purposes, several months or even years of mostly inactive data, often passes through phases of crisis when (mostly) batch programs tend to overshoot the time allocated to them and interfere with regular, human activity. When you start a new project, the volume of data usually changes, as shown in Figure 10-1. Initially, hardly anything other than a relatively small amount of reference data is loaded into the database. As a new system replaces an older one, data inherited from the legacy system is painfully loaded into the new one. First, because of the radical rethink of the application, conversion from the old system to the new system is fraught with difficulties. When deadlines have to be met and some noncritical tasks have to be postponed, the recovery of legacy data is a prime candidate for slipping behind schedule. As a result, this recovery goes on for some time after the system has become operational and teething problems have been solved. Second, the old system is usually much poorer from a functional perspective than the new one (otherwise, the cost of the new project would have made for difficult acceptance up the food chain). All this means that the volume of prehistoric data will be rather small compared to the data handled by the new system, and several months' worth of old data will probably be equivalent to a few weeks of new data at most. Meanwhile, operational data accumulates. Usually, one encounters the first serious performance issues about midway before the volume that the database is expected to hold at cruising speed. Bad queries and bad algorithms are almost invisible, from an end-user perspective, when volumes are low or moderate. The raw power of hardware often hides gigantic mistakes and may give comfortable sub-second response times for full scans of tables that contain several hundreds of thousands of rows. You may be seriously misusing the hardware, balancing gross programming mistakes with powerbut nobody will see that until your volume becomes respectable. At the first crisis point of the project, "expert tuning" is usually required to add a couple of indexes that should have been there from the start. The system then wobbles until it Figure 10-1. The evolution of data in a new applicationreaches the target volume. There are usually two target volumes: a nominal one (which has been grossly overestimated and which is the volume the system has officially been designed to manage) and the real target volume (which the system just manages to handle and which is often exceeded at some point because archiving of older data has been relegated to the very last lot in the project). The second and more serious crisis often comes in the wake of reaching that point. When archival has finally been put into production, architectural weaknesses reviewed, and some critical processes vigorously rewritten, the system finally reaches cruising speed, with an increase of data related to the natural growth of businessa growth that can lie anywhere between flatness and exponential exuberance. This presentation of the early months in the life of a new database application is partly caricature; but it probably bears more resemblance to reality than it often should, because the simple mistakes that lead to this caricature are not often avoided. However rigorously one tries to work, errors are made, because of pressure, lack of time for adequate testing, and ambiguous specifications. The only errors that can bring irredeemable failure are those linked to the design of the database and to the choice of the global architecturetwo topics that are closely related and that are the foundation of a system. If the foundation is not sturdy enough, you need to pull the whole building down before reconstructing. Other mistakes may require a more or less deep overhaul of what is in place. Most crises, however, need not happen. You must anticipate volume increases when coding. And you must quickly identify and rewrite a query that deteriorates in performance too quickly in the face of increasing data volumes. 10.1.1. Sensitivity of Operations to Volume IncreasesAll SQL operations are not equally susceptible to variations in performance when the number of rows to be processed increases. Some SQL operations are insensitive to volume increases, some see performance decrease linearly with volume, and some perform very badly with large volumes of data. 10.1.1.1. Insensitivity to volume increaseTypically, there will be no noticeable difference in a search on the primary key, whether you are looking for one particular key among 1,000 or one among 1,000,000. The common B-tree indexes are rather flat and efficient structures, and the size of the underlying table doesn't matter for a single-row, primary-key search. But insensitivity to volume increase doesn't mean that single primary-key searches are the ultimate SQL search method. When you are looking for a large number of rows, the "transactional" single-row operation can be significantly inefficient. Just consider the following, somewhat artificial, Oracle examples, each showing a range scan on a sequence-generated primary key: SQL> declare 2 n_id number; 3 cursor c is select customer_id 4 from orders 5 where order_id between 10000 and 20000; 6 begin 7 open c; 8 loop 9 fetch c into n_id; 10 exit when c%notfound; 11 end loop; 12 close c; 13 end; 14 / PL/SQL procedure successfully completed. Elapsed: 00:00:00.27 SQL> declare 2 n_id number; 3 begin 4 for i in 10000 .. 20000 5 loop 6 select customer_id 7 into n_id 8 from orders 9 where order_id = i; 10 end loop; 11 end; 12 / PL/SQL procedure successfully completed. Elapsed: 00:00:00.63 The cursor in the first example, which does an explicit range scan, runs twice as fast as the iteration on a single row. Why? There are multiple technical reasons ("soft parsing," a fast acknowledgement at each iteration that the DBMS engine has already met this statement and knows how to execute it, is one of them), but the single most important one is that in the first example the B-tree is descended once, and then the ordered list of keys is scanned and row addresses found and used to access the table; while in the second example, the B-tree is descended for each searched value in the order_id column. The most efficient way to process a large number of rows is not to iterate and apply the single-row process. 10.1.1.2. Linear sensitivity to volume increasesEnd users usually understand well that if twice as many rows are returned, a query will take more time to run; but many SQL operations double in time when operating on double the number of rows without the underlying work being as obvious to the end user, as in the case of a full table scan returning rows one after the other. Consider the case of aggregate functions; if you compute a max( ), that aggregation will always return a single row, but the number of rows the DBMS will have to operate on may vary wildly over the life of the application. Perfectly understandable, but end users will always see a single-row returned, so they may complain of performance degradation over time. The only way to ensure that the situation will not go from bad to worse is to put an upper bound on the number of rows processed by using another criterion such as a date range. Placing an upper bound keeps data volumes under control. In the case of max( ), the idea might be to look for the maximum since a given date, and not necessarily since the beginning of time. Adding a criterion to a query is not a simple technical issue and definitely depends on business requirements, but limiting the scope of queries is an option that certainly deserves to be pointed out to, and debated with, the people who draft specifications. 10.1.1.3. Non-linear sensitivity to volume increasesOperations that perform sorts suffer more from volume increases than operations that just perform a scan, because sorts are complex and require on average a little more than a single pass. Sorting 100 randomly ordered rows is not 10 times costlier than sorting 10 rows, but about 20 times costlierand sorting 1,000 rows is, on average, something like 300 times costlier than sorting 10 rows. In real life, however, rows are rarely randomly stored, even when techniques such as clustering indexes (Chapter 5) are not used. A DBMS can sometimes use sorted indexes for retrieving rows in the expected order instead of sorting rows after having fetched them, and performance degradation resulting from retrieving a larger sorted set, although real, is rarely shocking. Be careful though. Performance degradation from sorts often proceeds by fits and starts, because smaller sorts will be fully executed in memory, while larger sorts will result from the merge of several sorted subsets that have each been processed in memory before being written to temporary storage. There are, therefore, some "dangerous surroundings" where one switches from a relatively fast full-memory mode to a much slower memory-plus-temporary-storage mode. Adjusting the amount of memory allocated to sorts is a frequent and efficient tuning technique to improve sort-heavy operations when flirting with the dangerous limit. By way of example, Figure 10-2 shows how the fetch rate (number of rows fetched per unit of time) of a number of queries evolves as a table grows. The table used in the test is a very simple orders table defined as follows: order_id bigint(20) (primary key ) customer_id bigint(20) order_date datetime order_shipping char(1) order_comment varchar(50) The queries are first a simple primary key-based search: select order_date from orders where order_id = ? then a simple sort: select customer_id from orders order by order_date then a grouping: select customer_id, count(*) from orders group by customer_id having count(*) > 3 then the selection of the maximum value in a nonindexed column: select max(order_date) from orders and finally, the selection of the "top 5" customers by number of orders: select customer_id from (select customer_id, count(*) from orders group by customer_id order by 2 desc) as sorted_customers limit 5 (SQL Server would replace the closing limit 5 with an opening select top 5, and Oracle would replace limit 5 with where rownum <= 5.) The number of rows in the table has varied between 8,000 and around 1,000,000, while the number of distinct customer_id values remained constant at about 3,000. As you can see in Figure 10-2, the primary key search performs almost as well with one million rows as with 8,000. There seems to be some very slight degradation at the higher number, but the query is so fast that the degradation is hardly noticeable. By contrast, the sort suffers. The performance (measured by rows returned by unit of time, and therefore independent of the actual number of rows fetched) of the sorting query decreases by 40% when the number of rows goes from 8,000 to over one million. The degradation of performance, though, is even more noticeable for all the queries that, while returning the very same number of aggregated rows, have a great deal more rows to visit to get the relatively few rows to be returned. These queries are typically the type of queries that are going to draw the most complaints from end users. Note that the DBMS doesn't perform that badly: the performance decrease is very close to proportional to the number of rows, even for the two queries that require a sort (the queries labeled "Group by" and "Top" in Figure 10-2). But end users simply see the same amount of datajust returned much more slowly. Figure 10-2. How some simple queries behave when the queried table growsAll database operations are not equally sensitive to volume increases. Anticipate how queries will perform on target volumes. 10.1.1.4. Putting it all togetherThe main difficulty in estimating how a query will behave when data volumes increase is that high sensitivity to volume may be hidden deep inside the query. Typically, a query that finds the "current value" of an item by running a subquery that looks for the last time the price was changed, and then performs a max( ) over the price history, is highly sensitive. If we accumulate a large number of price changes, we shall probably suffer a slow performance degradation of the subquery, and by extension of the outer query as well. The degradation will be much less sensitive with an uncorrelated subquery, executed only once, than with a correlated subquery that will compound the effect by its being fired each time it is evaluated. Such degradation may be barely perceptible in a single-item operation, but will be much more so in batch programs.
Another issue is sorts. We have seen that an increase in the number of rows sorted leads to a quite perceptible degradation of performance. Actually, what matters is not so much the number of rows proper as the number of bytesin other words, the total amount of data to be sorted. This is why joins with what is mostly informational data, such as user-friendly labels associated with an obscure code (as opposed to the data involved in the filtering conditions driving the query), should be postponed to the very last stage of a query. Let's take a simple example showing why some joins should be delayed until the end of a query. Getting the names and addresses of our 10 biggest customers for the past year will require joining the orders and order_detail tables to get the amount ordered by each customer, and joining to a customers table to get each customer's details. If we only want to get our 10 biggest customers, we must get everybody who has bought something from us in the past year, sort them by decreasing amount, and then limit the output to the first ten resulting rows. If we join all the information from the start, we will have to sort the names and addresses of all our customers from the past year. We don't need to operate on such a large amount of data. What we must do is keep the amount of data to be sorted to the strict minimumthe customer identifier and the amount. Once everything is sorted, we can join the 10 customer_ids we are left with to the customers table to return all the information that is required. In other words, we mustn't write something like: select * from (select c.customer_name, c.customer_address, c.customer_postal_code, c.customer_state, c.customer_country sum(d.amount) from customers c, orders_o, order_detail d where c.customer_id = o.customer_id and o.order_date >= some date expression and o.order_id = d.order_id group by c.customer_name, c.customer_address, c.customer_postal_code, c.customer_state, c.customer_country order by 6 desc) as A limit 10 but rather something like: select c.customer_name, c.customer_address, c.customer_postal_code, c.customer_state, c.customer_country b.amount from (select a.customer_id, a.amount from (select o.customer_id, sum(d.amount) as amount from orders_o, order_detail d where o.order_date >= some date expression and o.order_id = d.order_id group by o.customer_id order by 2 desc) as a limit 10) as b, customers c where c.customer_id = b.customer_id order by b.amount desc The second sort is a safeguard in case the join modifies the order of the rows resulting from the inner subquery (remember that relational theory knows nothing about sorts and that the DBMS engine is perfectly entitled to process the join as the optimizer finds most efficient). We have two sorts instead of one, but the inner sort operates on "narrower" rows, while the outer one operates on only 10 rows. Remember what was said in Chapter 4: we must limit the "thickness" of the non-relational layer of SQL queries. The thickness depends on the number and complexity of operations, but also on the amount of data involved. Since sorts and limits of all kinds are non-relational operations, the optimizer will probably not rewrite a query to execute a join after having cut the number of customer identifiers to the bare minimum. Although an attentive reading of two queries may make it obvious that they will return the same result, mathematically proving that they always return the same result borders on mission impossible. An optimizer always plays it safe; a DBMS cannot afford to return wrong results by attempting daring rewrites, especially since it knows hardly anything about semantics. Our example is therefore a case in which the optimizer will limit its action to perform the join in inner queries as efficiently as possible. But ordering and aggregates put a stop to mingling inner and outer queries, and therefore the query will for the most part run as it is written. The query that performs the sort of amounts before the joins is, no doubt, very ugly. But this ugly SQL code is the way to write it, because it is the way the SQL engine should execute it if we want resilience to a strong increase in the number of customers and orders. To reduce the sensitivity of your queries to increases in the volume of data, operate only on the data that is strictly necessary at the deeper levels of a query. Keep ancillary joins for the outer level. 10.1.1.5. Disentangling subqueriesAs I have said more than once, correlated subqueries must be fired for each row that requires their evaluation. They are often a major issue when volume increases transform a few shots into sustained rounds of fire. In this section, a real-life example will illustrate both how ill-used correlated subqueries can bog a process down and how one can attempt to save such a situation. The issue at hand, in an Oracle context, is a query that belongs to an hourly batch to update a security management table. Note that this mechanism is already in itself a fudge to speed up security clearance checks on the system in question. Over time, the process takes more and more time, until reaching, on the production server, 15 minuteswhich for an hourly process that suspends application availability is a bit too much. The situation sends all bells ringing and all whistles blowing. Red alert! The slowness of the process has been narrowed down to the following statement: insert /*+ append */ into fast_scrty ( emplid, rowsecclass, access_cd, empl_rcd, name, last_name_srch, setid_dept, deptid, name_ac, per_status, scrty_ovrd_type) select distinct emplid, rowsecclass, access_cd, empl_rcd, name, last_name_srch, setid_dept, deptid, name_ac, per_status, 'N' from pers_search_fast Statistics are up to date, so we must focus our attack on the query. As it happens, the ill-named pers_search_fast is a view defined by the following query: 1 select a.emplid, 2 sec.rowsecclass, 3 sec.access_cd, 4 job.empl_rcd, 5 b.name, 6 b.last_name_srch, 7 job.setid_dept, 8 job.deptid, 9 b.name_ac, 10 a.per_status 11 from person a, 12 person_name b, 13 job, 14 scrty_tbl_dept sec 15 where a.emplid = b.emplid 16 and b.emplid = job.emplid 17 and (job.effdt= 18 ( select max(job2.effdt) 19 from job job2 20 where job.emplid = job2.emplid 21 and job.empl-rcd = job2.empl_rcd 22 and job2.effdt <= to_date(to_char(sysdate, 23 'YYYY-MM-DD'),'YYYY-MM-DD')) 24 and job.effseq = 25 ( select max(job3.effseq) 26 from job job3 27 where job.emplid = job3.emplid 28 and job.empl_rcd = job3.empl_rcd 29 and job.effdt = job3.effdt ) ) 30 and sec.access_cd = 'Y' 31 and exists 32 ( select 'X' 33 from treenode tn 34 where tn.setid = sec.setid 35 and tn.setid = job.setid_dept 36 and tn.tree_name = 'DEPT_SECURITY' 37 and tn.effdt = sec.tree_effdt 38 and tn.tree_node = job.deptid 39 and tn.tree_node_num between sec.tree_node_num 40 and sec.tree_node_num_end 41 and not exists 42 ( select 'X' 43 from scrty_tbl_dept sec2 44 where sec.rowsecclass = sec2.rowsecclass 45 and sec.setid = sec2.setid 46 and sec.tree_node_num <> sec2.tree_node_num 47 and tn.tree_node_num between sec2.tree_node_num 48 and sec2.tree_node_num_end 49 and sec2.tree_node_num between sec.tree_node_num 50 and sec.tree_node_num_end )) This type of "query of death" is, of course, too complicated for us to understand at a glance! As an exercise, though, it would be interesting for you to pause at this point, consider carefully the query, try to broadly define its characteristics, and try to identify possible performance stumbling blocks. If you are done pondering the query, let's compare notes. There are a number of interesting patterns that you may have noticed:
One of the very first things to do is to try to get an idea about the volumes involved; person_name is a view, but querying it indicates no performance issue. The data dictionary tells us how many rows we have: TABLE_NAME NUM_ROWS ------------------------------ ---------- TREENODE 107831 JOB 67660 PERSON 13884 SCRTY_TBL_DEPT 568 None of these tables is really large; it is interesting to notice that one need not deal with hundreds of millions of rows to perceive a significant degradation of performance as tables grow. The killing factor is how we are visiting tables. Finding out on the development server (obviously not as fast as the server used in production) how many rows are returned by the view is not very difficult but requires steel nerves: SQL> select count(*) from PERS_SEARCH_FAST; COUNT(*) ---------- 264185 Elapsed: 01:35:36.88 A quick look at indexes shows that both treenode and job are over-indexed, a common flaw of COTS packages. We do not have here a case of the "obviously missing index." Where must we look to find the reason that the query is so slow? We should look mostly at the lethal combination of a reasonably large number of rows and of correlated subqueries. The cascading exists/not exists in particular, is probably what does us in.
Take a closer look at the exists/not exists expression. The first level subquery introduces table TReenode. The second level subquery again hits table scrty_tbl_dept, already present in the outer query, and compares it both to the current row of the first level subquery (lines 47 and 48) and to the current row of the outer subquery (lines 44, 45, 46, 49, and 50)! If we want to get tolerable performance, we absolutely must disentangle these queries.
Dreadful jargon, especially when one has not the slightest idea of what the data is about. Can we try to express the same thing in a more intelligible way, in the hope that it will lead us to more intelligible and efficient SQL? The key point is probably in the there is no other node part. If there is no descendent node, then we are at the bottom of the tree for the node identified by the value of tree_node_num in treenode. The subqueries in the initial view text are hopelessly mingled with the outer queries. But we can write a single contained query that "forgets" for the time being about the link between TReenode and job and computes, for every node of interest in scrty_tbl_dept (a small table, under 600 rows), the number of children that match it on setid and rowsecclass: select s1.rowsecclass, s1.setid, s1.tree_node_num, tn.tree_node, count(*) - 1 children from scrty_tbl_dept s1, scrty_tbl_dept s2, treenode tn where s1.rowsecclass = s2.rowsecclass and s1.setid = s2.setid and s1.access_cd = 'Y' and tn.tree_name = 'DEPT_SECURITY' and tn.setid = s1.setid and tn.effdt = s1.tree_effdt and s2.tree_node_num between s1.tree_node_num and s1.tree_node_num_end and tn.tree_node_num between s2.tree_node_num and s2.tree_node_num_end group by s1.rowsecclass, s1.setid, s1.tree_node_num, tn.tree_node (The count(*) - 1 is for not counting the current row.) The resulting set will be, of course, small, at most a few hundred rows. We shall filter out nodes that are not leaf nodes (in our context) by using the preceding query as an inline view, and applying a filter: and children = 0 From here, and only from here, we can join to job and properly determine the final set. Giving the final text of the view would not be extremely interesting. Let's just point out that the first succession of exists: and (job.effdt= ( select max(job2.effdt) from job job2 where job.emplid = job2.emplid and job.empl-rcd = job2.empl_rcd and job2.effdt <= to_date(to_char(sysdate,'YYYY-MM-DD'), 'YYYY-MM-DD')) and job.effseq = ( select max(job3.effseq) from job job3 where job.emplid = job3.emplid and job.empl_rcd = job3.empl_rcd and job.effdt = job3.effdt ) ) is meant to find, for the most recent effdt for the current (emplid, empl_rcd) pair, the row with the highest effseq value. This condition is not, particularly in comparison to the other nested subquery, so terrible. Nevertheless, OLAP (or should we say analytical, since we are in an Oracle context?) functions can handle, when they are available, this type of "top of the top" case slightly more efficiently. A query such as: select emplid, empl_rcd, effdt, effseq from (select emplid, empl_rcd, effdt, effseq row_number() over (partition by emplid, empl_rcd order by effdt desc, effseq desc) rn from job where effdt <= to_date(to_char(sysdate,'YYYY-MM-DD'),'YYYY-MM-DD')) where rn = 1 will easily select the (emplid, empl_rcd) values that we are really interested in and will be easily reinjected into the main query as an inline view that will be joined to the rest. In real life, after rewriting this query, the hourly process that had been constantly lengthening fell from 15 to under 2 minutes. Minimize the dependencies of correlated subqueries on elements from outer queries. 10.1.2. Partitioning to the RescueWhen the number of rows to process is on the increase, index searches that work wonders on relatively small volumes become progressively inefficient. A typical primary key search requires the DBMS engine to visit 3 or 4 pages, descending the index, and then the DBMS must visit the table page. A range scan will be rather efficient, especially when applied to a clustering index that constrains the table rows to be stored in the same order as the index keys. Nevertheless, there is a point at which the constant to-and-fro between index page and table page becomes costlier than a plain linear search of the table. Such a linear search can take advantage of parallelism and read-ahead facilities made available by the underlying operating system and hardware. Index-searches that rely on key comparisons are more sequential by nature. Large numbers of rows to inspect exemplify the case when accesses should be thought of in terms of sweeping scans, not isolated incursions, and joins performed through hashes or merges, not loops (all this was discussed in Chapter 6). Table scans are all the more efficient when the ratio of rows that belong to the result set to rows inspected is high. If we can split our table, using the data-driven partitioning introduced in Chapter 5, in such a way that our search criteria can operate on a well defined physical subset of the table, we maximize scan efficiency. In such a context, operations on a large range of values are much more efficient when applied brutishly to a well-isolated part of a table than when the boundaries have to be checked with the help of an index. Of course, data-driven partitioning doesn't miraculously solve all volume issues:
where date_column_1 >= some value and date_column_2 <= some other value will not enable us to use either partitioning or indexing any more efficiently than if only one of the conditions was specified. It's by specifying a between (or any semantic equivalent) encompassing a small number of partitions that we shall make best usage of partitioning. Half-bounded conditions make a poor use of both indexes and partitions. 10.1.3. Data PurgesArchival and data purges are too often considered ancillary matters, until they are seen as the very last hope for retrieving those by-and-large satisfactory response times of six months ago. Actually, they are extremely sensitive operations that, poorly handled, can put much strain on a system and contribute to pushing a precarious situation closer to implosion. The ideal case is when tables are partitioned (true partitioning or partitioned view) and when archival and purges operate on partitions. If partitions can be simply detached, in one way or another, then an archival (or purge) operation is trivial: a partition is archived and a new empty one possibly created. If not, we are still in a relatively strong position: the query that selects rows for archival will be a simple one, and afterwards it will be possible to truncate a partition--truncate being a way of emptying a table or partition that bypasses most of the usual mechanisms and is therefore much faster than regular deletes.
The less-than-ideal, but oh-so-common case is when archival is triggered by age and other conditions. Accountants, for instance, are often reluctant to archive unpaid invoices, even when rather old. This makes the rather simple and elegant partition shuffling or truncation look too crude. Must we fall back on the dull-but-trusted delete? It is at this point interesting to try to rank data manipulation operations (inserts, updates, and deletes) in terms of overall cost. We have seen that inserts are pretty costly, in large part because when you insert a new row, all indexes on the table have to be maintained. Updates require only maintenance of the indexes on the updated columns. Their weakness, compared to inserts, is two-fold: first, they are associated with a search (a where clause) that can be as disastrous as with a select, with the aggravating circumstance that in the meanwhile locks are held. Second, the previous value, inexistent in the case of an insert, must be saved somewhere so as to be available in case of rollback. Deletes combine all the shortcomings: they affect all indexes, are usually associated with a where clause that can be slow, and need to save the values for a possible transaction rollback. Of all operations that change data, deletes offer the greatest potential for trouble. If we can therefore save on deletes, even at the price of other operations, we are likely to end up on the winning side. When a table is partitioned and archival and purge are dependent mostly on a date condition with strings attached, we can consider a three stage purge:
Without partitioning, the situation is much more difficult. In order to limit lock durationand assuming of course that once a row has attained the "ready for archival" state, no operation whatsoever can put it back to the "no, wait, I have second thought" statewe can consider a two-step operation. This two-step operation will be all the more advantageous given that the query that identifies rows for archiving is a slow-running one. What we may do in that case is:
A major justification for temporary tables is to enable massive, table-oriented operations that would outperform row-wise operations. |