8.4. Dynamically Defined Search CriteriaOne of the most common causes for awful visible performance (as opposed to the common dismal performance of batch programs, which can often be hidden for a while) is the use of dynamically defined search criteria. In practice, such criteria are a consequence of the dreaded requirement to "let the user enter the search criteria as well as the sort order via a screen interface." The usual symptoms displayed by this type of application is that many queries perform reasonably well, but that unfortunately from time to time a query that seems to be almost the same as a well-performing query happens to be very, very slow. And of course the problem is difficult to fix, since everything is so dynamic. Dynamic-search applications are often designed as a two-step drill-down query, as in Figure 8-2. Basically, a first screen is displayed to the user with a large choice of criteria and an array of possible conditions such as exclude or date between ... and .... These criteria are used to dynamically build a query that returns a list with some identifier and description, from which you can view all the associated details by selecting one particular item in the list. Figure 8-2. A typical multi-criteria searchWhen the same columns from the same tables are queried with varying search criteria, the key to success usually lays in a clever generation of SQL queries by the program that accesses the database. I am going to illustrate my point in detail with a very simple example, a movie database , and we shall only be concerned with returning a list of movie titles that satisfy a number of criteria. The environment used in this example is a widely popular combination, namely PHP and MySQL. Needless to say, the techniques shown in this chapter are in no way specific to PHP or to MySQLor to movie databases. 8.4.1. Designing a Simple Movie Database and the Main QueryOur central table will be something such as the following: Table MOVIES movie_id int(10) (auto-increment) movie_title varchar(50) movie_country char(2) movie_year year(4) movie_category int(10) movie_summary varchar(250) We certainly need a categories table (referenced by a foreign key on movie_category) to hold the different genres, such as Action, Drama, Comedy, Musical, and so forth. It can be argued that some movies sometimes span several categories, and a better design would involve an additional table representing a many-to-many relationship (meaning that one genre can be associated with several movies and that each movie can be associated with several genres as well), but for the sake of simplicity we shall admit that a single, main genre is enough for our needs in this example. Do we need one table for actors and another for directors? Creating two tables would be a design mistake, because it is quite common to see actors-turned-directors, and there is no need to duplicate personal information. From time to time one even finds a movie directed by one of the lead actors. We therefore need three more tables: people to store information such as name, first name, sex, year of birth, and so on; roles to define how people may contribute to a movie (actor, director, but also composer, director of photography, and the like); and movie_credits to state who was doing what in which movie. Figure 8-3 shows our complete movie schema. Figure 8-3. The movie database schemaLet's suppose now that we want to let people search movies in our database by specifying either: words from the title, the name of the director, or up to three names of any of the actors. Following is the source of our prototype page, which I have built in HTML to act as our screen display: <html> <head> <title>Movie Database</title> </head> <body> <CENTER> <HR> <BR> Please fill the form to query our database and click on <b>Search</b> when you are done... <BR> <HR> <BR> <form action="display_query.php" method="post"> <TABLE WIDTH="75%"> <TR> <TD>Movie Title :</TD> <TD><input type="text" name="title"></TD> </TR> <TR> <TD>Director :</TD> <TD><input type="text" name="director"></TD> </TR> <TR> <TD>Actor :</TD> <TD><input type="text" name="actor1"></TD> </TR> <TR> <TD>Actor :</TD> <TD><input type="text" name="actor2"></TD> </TR> <TR> <TD>Actor :</TD> <TD><input type="text" name="actor3"></TD> </TR> <TR> <TD COLSPAN="2" ALIGN="CENTER"> <HR> <input type="Submit" value="Search"> <HR> </TD> </TR> </TABLE> </form> </CENTER> </body> </html> This prototype page shows on screen as in Figure 8-4. First, let me make a few remarks:
Figure 8-4. The movie database search screenCertainly the thing not to do is to generate a query containing a criterion such as: and upper(<value entered for actor1>) = concat(upper(people_firstname), ' ', upper(people_name)) As shown in Chapter 3, the right part of the equality in such a criterion would prevent us from using any regular index we might have logically created on the name. Several products allow the creation of functional indexes and index the result of expressions, but the simplest and therefore best solution is probably as follows:
The first point simply means inserting upper(string) instead of string, which is easy enough. Keep the second point in mind for the time being: I'll come back to it in just a bit. If users were to fill all entry fields, all the time, then our resulting main query could be something such as: select movie_title, movie_year from movies inner join movie_credits mc1 on mc1.movie_id = movies.movie_id inner join people actor1 on mc1.people_id = actor1.people_id inner join roles actor_role on mc1.role_id = actor_role.role_id and mc2.role_id = actor_role.role_id and mc3.role_id = actor_role.role_id inner join movie_credits mc2 on mc2.movie_id = movies.movie_id inner join people actor2 on mc2.people_id = actor2.people_id inner join movie_credits mc3 on mc3.movie_id = movies.movie_id inner join people actor3 on mc3.people_id = actor3.people_id inner join movie_credits mc4 on mc4.movie_id = movies.movie_id inner join people director on mc4.people_id = director.people_id inner join roles director_role on mc4.role_id = director_role.role_id where actor_role.role_name = 'ACTOR' and director_role.role_name = 'DIRECTOR' and movies.movie_title like 'CHARULATA%' and actor1.people_firstname = 'SOUMITRA' and actor1.people_name = 'CHATTERJEE' and actor2.people_firstname = 'MADHABI' and actor2.people_name = 'MUKHERJEE' and actor3.people_firstname = 'SAILEN' and actor3.people_name = 'MUKHERJEE' and director.people_name = 'RAY' and director.people_firstname = 'SATYAJIT' Unfortunately, will somebody who can name the title, director and the three main actors of a film (most typically a movie buff) really need to use our database? This is very unlikely. The most likely search will probably be when a single field or possibly two, at most, will be populated. We must therefore anticipate blank fields, asking the question: what will we do when no value is passed? A common way of coding one's way out of a problematic situation like this is to keep the select list unchanged; then to join together all the tables that may intervene in one way or another, using suitable join conditions; and then to replace the straightforward conditions from the preceding example with a long series of: and column_name = coalesce(?, column_name) where ? will be associated with the value from an entry field, and coalesce( ) is the function that returns the first one of its arguments that is non null. If a value is provided, then a filter is applied; otherwise, all values in the column pass the test. All values? Not really; if a column contains a NULL, the condition for that column will evaluate to false. We cannot say that something we don't know is equal to something we don't know, even if it is the same something (nothing?). If one condition in our long series of conditions linked by and evaluates to false, the query will return nothing, which is certainly not what we want. There is a solution though, which is to write: and coalesce(column_name, constant) = coalesce(?, column_name, constant) This solution would be absolutely perfect if only it did not mean forfeiting the use of any index on column_name when a parameter is specified. Must we sacrifice the correctness of results to performance, or performance to the correctness of results? The latter solution is probably preferable, but unfortunately both of them might also mean sacrificing our job, a rather unpleasant prospect. A query that works in all cases, whatever happens, is quite difficult to write. The commonly adopted solution is to build such a query dynamically. What we can do in this example scenario is to store in a string everything up to the where and the fixed conditions on role names, and then to concatenate to this string the conditions which have been input by our program userand only those conditions. A variable number of search criteria calls for dynamically built queries. Assuming that a user searched our database for movies starring Amitabh Bachchan, the resulting, dynamically written query might be something like the following: select distinct movie_title, movie_year from movies inner join movie_credits mc1 on mc1.movie_id = movies.movie_id inner join people actor1 on mc1.people_id = actor1.people_id inner join roles actor_role on mc1.role_id = actor_role.role_id and mc2.role_id = actor_role.role_id and mc3.role_id = actor_role.role_id inner join movie_credits mc2 on mc2.movie_id = movies.movie_id inner join people actor2 on mc2.people_id = actor2.people_id inner join movie_credits mc3 on mc3.movie_id = movies.movie_id inner join people actor3 on mc3.people_id = actor3.people_id inner join movie_credits mc4 on mc4.movie_id = movies.movie_id inner join people director on mc4.people_id = director.people_id inner join roles director_role on mc4.role_id = director_role.role_id where actor_role.role_name = 'ACTOR' and director_role.role_name = 'DIRECTOR' and actor1.people_firstname = 'AMITABH' and actor1.people_name = 'BACHCHAN' order by movie_title, movie_year First, let me make two remarks:
Besides performance, there is also a very serious concern associated with dynamically built hardcoded queries, a security concern: such queries present a wide-open door to the technique known as SQL injection. What is SQL injection? Let's say that we run a commercial operation, and that only subscribers are allowed to query the full database while access to movies older than 1960 is free to everybody. Suppose that a malicious non-subscriber enters into the movie_title field something such as: X' or 1=1 or 'X' like 'X When we simply concatenate entry fields to our query text we shall end up with a condition such as: where movie_title like 'X' or 1=1 or 'X' like 'X%' and movie_year < 1960 which is always true and will obviously filter nothing at all! Concatenating the entry field to the SQL statement means that in practice anybody will be able to download our full database without any subscription. And of course some information is more sensitive than movie databases. Binding variables protects from SQL injection. SQL injection is a very real security matter for anyone running an on-line database, and great care should be taken to protect against its malicious use. When using dynamically built queries, use parameter markers and pass values as bind variables, for both performance and security (SQL injection) reasons. A query with prepared joins and dynamically concatenated filtering conditions executes very quickly when the tables are properly indexed. But there is nevertheless something that is worrisome. The preceding example query is a very complicated query, particularly when we consider the simplicity of both the output result and of what we provided as input. 8.4.2. Right-Sizing QueriesIn fact, the complexity of the query is just one part of the issue. What happens, in the case of the final query in the preceding section, if we have not recorded the name of the director in our database, or if we know only the names of the two lead actors? The query will return no rows. All right, can we not use outer joins then, which return matching values when there is one and NULL when there is none? Using outer joins might be a solution, except that we don't know what exactly will be queried. What if we only have the name of the director in our database? In fact, we would need outer joins everywhereand putting them everywhere is often, logically, impossible. We therefore have an interesting case, in which we are annoyed by missing information even if all of our attributes are defined as mandatory and we have absolutely no NULL values in the database, simply because our query so far assumes joins that may be impossible to satisfy. In fact, in the particular case when only one actor name is provided, we need a query no more complicated than the following: select movie_title, movie_year from movies inner join movie_credits mc1 on mc1.movie_id = movies.movie_id inner join people actor1 on mc1.people_id = actor1.people_id inner join roles actor_role on mc1.role_id = actor_role.role_id where actor_role.role_name = 'ACTOR' and actor1.people_firstname = 'AMITABH' and actor1.people_name = 'BACHCHAN' order by movie_title, movie_year This "tight-fit" query assumes nothing about our also knowing the name of the director, nor of a sufficient number of other actors, and hence there is no need for outer joins. Since we have already begun building our query dynamically, why not try to inject a little more intelligence in our building exercise, so as to obtain a query really built to order, exactly tailored to our needs? Our code will no doubt be more complicated. Is the complication worth it? The simple fact that we are now certain to return all the information available when given an actor's name, even when we don't know who directed a film, should be reason enough for an unqualified "yes." But performance reasons also justify taking this step. Nothing is as convincing as running a query in a loop a sufficient number of times to show the difference between two approaches: our "tight-fit" query is five times faster than the "one-size-fits-all" query. All other things aside, does it matter if our query executes in 0.001 second instead of 0.005 second? Not much, if our database is only queried now and then. But there may be a day when queries arrive at a rate higher than we can service and keep up with, and then we'll have a problem. Queries will have to be queued, and the queue length will increase very quicklyas fast as the number of complaints about poor database performance. Simply put, going five times faster enables five times as many queries to be processed on the same hardware. (We will consider these issues in more detail in Chapter 9.) Matching criteria with dynamically built queries improves performance by minimizing joins, and eliminates the issue of missing values. 8.4.3. Wrapping SQL in PHPLet's first start our PHP page with a smattering of regular HTML before the real PHP code: <html> <head> <title>Query result</title> </head> <body> <CENTER> <table width="80%"> <TR><TH>Title</TH><TH>Year</TH><TR> <?php ... (Our page would probably be nicer with a stylesheet....) Once we have our handle that represents the connection to the database, the very first thing to do is to get the values that were submitted to the entry screen. Since everything is stored in uppercase in our database we can convert the user-entered values directly to uppercase too. This is of course something that can be done in the SQL code, but it costs nothing to do it in the PHP code: $title=strtoupper($_POST['title']); $director=strtoupper($_POST['director']); $actor1=strtoupper($_POST['actor1']); $actor2=strtoupper($_POST['actor2']); $actor3=strtoupper($_POST['actor3']); We now have a technical problem linked to the implementation of PHP binding. Following is the process for binding variables in PHP:
All parameters are identified by position (the same is true with JDBC, but not with all database systems and languages; for instance, you will refer to bind variables by name in an SQLJ program). But our main problem is the single call to bind_param( ), which is very convenient when we know exactly how many parameters we have to bind, but is not so in our case here, in which we do not know in advance how many values a user will enter. It would be much more convenient in our case to have a method allowing us to loop and bind values one by one. One way to bind a variable number of values, which is not necessarily the most elegant, is to loop on all the variables we have received from the form, check which ones actually contain something, and store each value in the subsequent positions of an array. We have no problem doing this with our example since all the values we may get are character strings. If we were expecting something elsefor instance the year when a movie was first shownthe most sensible approach would probably be to treat such a value as a string inside the PHP code and to convert it to a number or date in the SQL code. We can use a $paramcnt variable to count how many parameters were provided by the user of the form, and store the values into a $params array: $paramcnt=0; if ($title != "") { $params[$paramcnt] = $title; $paramcnt++; } Things get a little more complicated with people names. Remember that we have decided that having a single field to enter a name was more user-friendly than having to enter the first and last names into two separate fields. However, comparing the string entered by the user to the concatenation of first name and last name in our people table would prevent the query from using the index on the last name and might, moreover, yield wrong results: if the user has mistakenly typed two spaces instead of one between first name and last name, for instance, we shall not find the person. What we are therefore going to do is to split the entry field into first name and last name, assuming that the last name is the last word, and that the first name, which may be composed of 0, 1, or several words, is what precedes the last name. In PHP, we can easily write such a function which sets two parameters that are passed by reference: function split_name($string, &$firstname, &$lastname) { /* * We assume that the last name is the last element of the string, * and that we may have several first names */ $pieces = explode(" ", $string); $parts = count($pieces); $firstnames = array_slice($pieces, 0, $parts - 1); $firstname = implode(" ", $firstnames); $lastname = $pieces[$parts - 1]; } This function will allow us to split $director into $dfn and $dn, $actor1 into $a1fn and $a1n and so on, everything being coded on the same model: if ($director != "") { /* Split firstname / name */ split_name($director, $dfn, $dln); if ($dfn != "") { $params[$paramcnt] = $dfn; $paramcnt++; } $params[$paramcnt] = $dln; $paramcnt++; } Once we have inspected our parameters, all we have to do is to build our query, being very careful to insert the parameter markers for the bind variables in exactly the same order as they will appear in the $params array: $query = "select movie_title, movie_year " ."from movies"; /* Director was specified ? */ if ($director != "") { $query = $query." inner join movie_credits mcd" ." on mcd.movie_id = movies.movie_id" ." inner join people director" ." on mcd.people_id = director.people_id" ." inner join roles director_role" ." on mcd.role_id = director_role.role_id"; } /* Any actor was specified ? */ if ($actor1.$actor2.$actor3 != "") { /* * First the join on the ROLES table */ $query = $query." inner join roles actor_role"; /* * Even if only one actor was specified, we may * not necessarily find the name in $actor1 so careful */ $actcnt = 0; if ($actor1 != "") { if ($actcnt == 0) { $query = $query." on"; } else { $query = $query." and"; } $query = $query." mc1.role_id = actor_role.role_id"; } if ($actor2 != "") { ... } if ($actor3 != "") { ... } /* * Then join on MOVIE_CREDITS and PEOPLE */ if ($actor1 != "") { $query = $query." inner join movie_credits mc1" ." on mc1.movie_id = movies.movie_id" ." inner join people actor1" ." on actor1.people_id = mc1.people_id"; } if ($actor2 != "") { ... } if ($actor3 != "") { ... } } /* * We are done with the FROM clause; we are using the old 1=1 * trick to avoid checking each time whether it is the very * first condition or not - the latter case requires an 'and'. */ $query = $query." where 1=1"; /* * Be VERY careful to add parameters in the same order they were * stored into the $params array */ if ($title != "") { $query = $query." and movies.movie_title like concat(?, '%')"; } /* Director was specified ? */ if ($director != "") { $query = $query." and director_role.role_name = 'DIRECTOR'"; if ($dfn != "") { /* * Use like instead of regular equality for the first name, it will * work with some abbreviations or initials. */ $query = $query ." and director.people_firstname like concat(?, '%')"; } $query = $query." and director.people_name = ?"; } if ($actor1.$actor2.$actor3 != "") { $query = $query." and actor_role.role_name = 'ACTOR'"; if ($actor1 != "") { ... } if ($actor2 != "") { ... } if ($actor3 != "") { ... } } Once our query is ready, we call the prepare( ) method, then bind our variables; this is where our code is not very pretty, since we can have between 1 and 9 variables to bind and handle, and each variable must be handled separately: /* create a prepared statement */ if ($stmt = $mysqli->prepare($query)) { /* * Bind parameters for markers * * This is the messiest part. * We can have anything between 1 and 9 parameters in all (all strings) */ switch ($paramcnt) { case 1 : $stmt->bind_param("s", $params[0]); break; case ... ... break; case 9 : $stmt->bind_param("sssssssss", $params[0], $params[1], $params[2], $params[3], $params[4], $params[5], $params[6], $params[7], $params[8]); break; default : break; } Et voilà! We are done and just have to execute the query and display the result: /* execute query */ $stmt->execute( ); /* fetch values */ $stmt->bind_result($mt, $my); while ($row = $stmt->fetch( )) { printf ("<tr><TD>%s</TD><TD>%d</TD></TR>\n", $mt, $my); } /* close statement */ $stmt->close( ); } else { printf("Error: %s\n", $mysqli->sqlstate); } ?> </TABLE> </CENTER> Obviously, the code here is significantly more complicated than if we had tried to have one single query. It may seem surprising, after I have advocated pushing as much work as possible onto the DBMS side, to now find me defending the use of complicated code to build as simple a SQL statement as possible. Doing as much work on the SQL side as possible makes sense when it is work that has to be performed. But joining three times as many tables as are needed in the average query, with some of these useless joins not necessarily being very efficient (especially when they happen to be against complex views) makes no sense at all. By intelligently building the query, we tightly control what is executed in terms of security, correctness of the result, and performance. Any simpler solution bears risks of sacrificing at least one of these aspects. To summarize, there are at least three mistakes that are very commonly made in queries that take a variable number of search criteria:
More intelligence in the dynamic construction of an SQL statement makes for a more efficient SQL statement. |