Section 8.4. Dynamically Defined Search Criteria

8.4. Dynamically Defined Search Criteria

One of the most common causes for awful visible performance (as opposed to the common dismal performance of batch programs, which can often be hidden for a while) is the use of dynamically defined search criteria. In practice, such criteria are a consequence of the dreaded requirement to "let the user enter the search criteria as well as the sort order via a screen interface."

The usual symptoms displayed by this type of application is that many queries perform reasonably well, but that unfortunately from time to time a query that seems to be almost the same as a well-performing query happens to be very, very slow. And of course the problem is difficult to fix, since everything is so dynamic.

Dynamic-search applications are often designed as a two-step drill-down query, as in Figure 8-2. Basically, a first screen is displayed to the user with a large choice of criteria and an array of possible conditions such as exclude or date between ... and .... These criteria are used to dynamically build a query that returns a list with some identifier and description, from which you can view all the associated details by selecting one particular item in the list.

Figure 8-2. A typical multi-criteria search

When the same columns from the same tables are queried with varying search criteria, the key to success usually lays in a clever generation of SQL queries by the program that accesses the database. I am going to illustrate my point in detail with a very simple example, a movie database , and we shall only be concerned with returning a list of movie titles that satisfy a number of criteria. The environment used in this example is a widely popular combination, namely PHP and MySQL. Needless to say, the techniques shown in this chapter are in no way specific to PHP or to MySQLor to movie databases.

8.4.1. Designing a Simple Movie Database and the Main Query

Our central table will be something such as the following:

     Table MOVIES        movie_id       int(10) (auto-increment)        movie_title    varchar(50)        movie_country  char(2)        movie_year     year(4)        movie_category int(10)        movie_summary  varchar(250)

We certainly need a categories table (referenced by a foreign key on movie_category) to hold the different genres, such as Action, Drama, Comedy, Musical, and so forth. It can be argued that some movies sometimes span several categories, and a better design would involve an additional table representing a many-to-many relationship (meaning that one genre can be associated with several movies and that each movie can be associated with several genres as well), but for the sake of simplicity we shall admit that a single, main genre is enough for our needs in this example.

Do we need one table for actors and another for directors? Creating two tables would be a design mistake, because it is quite common to see actors-turned-directors, and there is no need to duplicate personal information. From time to time one even finds a movie directed by one of the lead actors.

We therefore need three more tables: people to store information such as name, first name, sex, year of birth, and so on; roles to define how people may contribute to a movie (actor, director, but also composer, director of photography, and the like); and movie_credits to state who was doing what in which movie. Figure 8-3 shows our complete movie schema.

Figure 8-3. The movie database schema

Let's suppose now that we want to let people search movies in our database by specifying either: words from the title, the name of the director, or up to three names of any of the actors. Following is the source of our prototype page, which I have built in HTML to act as our screen display:

     <html>     <head>       <title>Movie Database</title>     </head>     <body>     <CENTER>         <HR>         <BR>         Please fill the form to query our database and click on <b>Search</b> when you are done...         <BR>         <HR>         <BR>     <form action="display_query.php" method="post">         <TABLE WIDTH="75%">         <TR>           <TD>Movie Title :</TD>           <TD><input type="text" name="title"></TD>         </TR>         <TR>           <TD>Director    :</TD>           <TD><input type="text" name="director"></TD>         </TR>         <TR>           <TD>Actor       :</TD>           <TD><input type="text" name="actor1"></TD>         </TR>         <TR>           <TD>Actor       :</TD>           <TD><input type="text" name="actor2"></TD>         </TR>         <TR>           <TD>Actor       :</TD>           <TD><input type="text" name="actor3"></TD>         </TR>         <TR>           <TD COLSPAN="2" ALIGN="CENTER">           <HR>           <input type="Submit" value="Search">           <HR>           </TD>         </TR>         </TABLE>     </form>     </CENTER>     </body>     </html>

This prototype page shows on screen as in Figure 8-4.

First, let me make a few remarks:

Although we want to store the first and last names separately in our database (definitely more convenient if we want to generate a listing ordered by last name), we don't want our entry form to look like a passport renewal form: we just want a single entry field for each individual.
We want our query input values to be case-insensitive.

Figure 8-4. The movie database search screen

Certainly the thing not to do is to generate a query containing a criterion such as:

     and upper(<value entered for actor1>) =               concat(upper(people_firstname), ' ', upper(people_name))

As shown in Chapter 3, the right part of the equality in such a criterion would prevent us from using any regular index we might have logically created on the name. Several products allow the creation of functional indexes and index the result of expressions, but the simplest and therefore best solution is probably as follows:

Systematically store in uppercase any character column that is likely to be queried (we can always write a function to beautify it before output).
Split the entry field into first name and (last) name before passing it to the query.

The first point simply means inserting upper(string) instead of string, which is easy enough. Keep the second point in mind for the time being: I'll come back to it in just a bit.

If users were to fill all entry fields, all the time, then our resulting main query could be something such as:

     select movie_title, movie_year     from movies          inner join movie_credits mc1            on mc1.movie_id = movies.movie_id          inner join people actor1            on  mc1.people_id = actor1.people_id          inner join roles actor_role            on  mc1.role_id = actor_role.role_id            and mc2.role_id = actor_role.role_id            and mc3.role_id = actor_role.role_id          inner join movie_credits mc2            on mc2.movie_id = movies.movie_id          inner join people actor2            on  mc2.people_id = actor2.people_id          inner join movie_credits mc3            on mc3.movie_id = movies.movie_id          inner join people actor3            on  mc3.people_id = actor3.people_id          inner join movie_credits mc4            on mc4.movie_id = movies.movie_id          inner join people director            on  mc4.people_id = director.people_id          inner join roles director_role            on mc4.role_id = director_role.role_id     where actor_role.role_name = 'ACTOR'       and director_role.role_name = 'DIRECTOR'       and movies.movie_title like 'CHARULATA%'       and actor1.people_firstname = 'SOUMITRA'       and actor1.people_name = 'CHATTERJEE'       and actor2.people_firstname = 'MADHABI'       and actor2.people_name = 'MUKHERJEE'       and actor3.people_firstname = 'SAILEN'       and actor3.people_name = 'MUKHERJEE'       and director.people_name = 'RAY'       and director.people_firstname = 'SATYAJIT'

Unfortunately, will somebody who can name the title, director and the three main actors of a film (most typically a movie buff) really need to use our database? This is very unlikely. The most likely search will probably be when a single field or possibly two, at most, will be populated. We must therefore anticipate blank fields, asking the question: what will we do when no value is passed?

A common way of coding one's way out of a problematic situation like this is to keep the select list unchanged; then to join together all the tables that may intervene in one way or another, using suitable join conditions; and then to replace the straightforward conditions from the preceding example with a long series of:

      and column_name = coalesce(?, column_name)

where ? will be associated with the value from an entry field, and coalesce( ) is the function that returns the first one of its arguments that is non null. If a value is provided, then a filter is applied; otherwise, all values in the column pass the test.

All values? Not really; if a column contains a NULL, the condition for that column will evaluate to false. We cannot say that something we don't know is equal to something we don't know, even if it is the same something (nothing?). If one condition in our long series of conditions linked by and evaluates to false, the query will return nothing, which is certainly not what we want. There is a solution though, which is to write:

     and coalesce(column_name, constant) = coalesce(?, column_name, constant)

This solution would be absolutely perfect if only it did not mean forfeiting the use of any index on column_name when a parameter is specified. Must we sacrifice the correctness of results to performance, or performance to the correctness of results? The latter solution is probably preferable, but unfortunately both of them might also mean sacrificing our job, a rather unpleasant prospect.

A query that works in all cases, whatever happens, is quite difficult to write. The commonly adopted solution is to build such a query dynamically. What we can do in this example scenario is to store in a string everything up to the where and the fixed conditions on role names, and then to concatenate to this string the conditions which have been input by our program userand only those conditions.

A variable number of search criteria calls for dynamically built queries.

Assuming that a user searched our database for movies starring Amitabh Bachchan, the resulting, dynamically written query might be something like the following:

     select distinct movie_title, movie_year     from movies          inner join movie_credits mc1            on mc1.movie_id = movies.movie_id          inner join people actor1            on  mc1.people_id = actor1.people_id          inner join roles actor_role            on  mc1.role_id = actor_role.role_id            and mc2.role_id = actor_role.role_id            and mc3.role_id = actor_role.role_id          inner join movie_credits mc2            on mc2.movie_id = movies.movie_id          inner join people actor2            on  mc2.people_id = actor2.people_id          inner join movie_credits mc3            on mc3.movie_id = movies.movie_id          inner join people actor3            on  mc3.people_id = actor3.people_id          inner join movie_credits mc4            on mc4.movie_id = movies.movie_id          inner join people director            on  mc4.people_id = director.people_id          inner join roles director_role            on mc4.role_id = director_role.role_id     where actor_role.role_name = 'ACTOR'       and director_role.role_name = 'DIRECTOR'       and actor1.people_firstname = 'AMITABH'       and actor1.people_name = 'BACHCHAN'     order by movie_title, movie_year

First, let me make two remarks:

We have to make our select a select distinct. We do this because we keep the joins without any additional condition. Otherwise, as many rows would be returned for each movie as we have actors and directors recorded for the movie.
It is very tempting when building the query to concatenate the values that we receive to the SQL text under construction proper, thus obtaining a query exactly as above. This is not, in fact, what we should do. I have already mentioned the subject of bind variables; it is now time to explain how they work. The proper course is indeed to build the query with placeholders such as ? (it depends on the language), and then to call a special function to bind the actual values to the placeholders. It may seem more work for the developer, but in fact it will mean less work for the DBMS engine. Even if we rebuild the query each time, the DBMS usually caches the statements it executes as a part of its standard optimization routines. If the SQL engine is given a query to process that it finds in its cache, the DBMS has already parsed the SQL text and the optimizer has already determined the best execution path. If we use placeholders, all queries that are built on the same pattern (such as searches for movies starring one particular actor) will use the same SQL text, irrespective of the actor's name. All the setup is done, the query can be run immediately, and the end user gets the response faster.

Besides performance, there is also a very serious concern associated with dynamically built hardcoded queries, a security concern: such queries present a wide-open door to the technique known as SQL injection. What is SQL injection? Let's say that we run a commercial operation, and that only subscribers are allowed to query the full database while access to movies older than 1960 is free to everybody. Suppose that a malicious non-subscriber enters into the movie_title field something such as:

     X' or 1=1 or 'X' like 'X

When we simply concatenate entry fields to our query text we shall end up with a condition such as:

     where movie_title like 'X' or 1=1 or 'X' like 'X%'       and movie_year < 1960

which is always true and will obviously filter nothing at all! Concatenating the entry field to the SQL statement means that in practice anybody will be able to download our full database without any subscription. And of course some information is more sensitive than movie databases. Binding variables protects from SQL injection. SQL injection is a very real security matter for anyone running an on-line database, and great care should be taken to protect against its malicious use.

When using dynamically built queries, use parameter markers and pass values as bind variables, for both performance and security (SQL injection) reasons.

A query with prepared joins and dynamically concatenated filtering conditions executes very quickly when the tables are properly indexed. But there is nevertheless something that is worrisome. The preceding example query is a very complicated query, particularly when we consider the simplicity of both the output result and of what we provided as input.

8.4.2. Right-Sizing Queries

In fact, the complexity of the query is just one part of the issue. What happens, in the case of the final query in the preceding section, if we have not recorded the name of the director in our database, or if we know only the names of the two lead actors? The query will return no rows. All right, can we not use outer joins then, which return matching values when there is one and NULL when there is none?

Using outer joins might be a solution, except that we don't know what exactly will be queried. What if we only have the name of the director in our database? In fact, we would need outer joins everywhereand putting them everywhere is often, logically, impossible. We therefore have an interesting case, in which we are annoyed by missing information even if all of our attributes are defined as mandatory and we have absolutely no NULL values in the database, simply because our query so far assumes joins that may be impossible to satisfy.

In fact, in the particular case when only one actor name is provided, we need a query no more complicated than the following:

     select movie_title, movie_year     from movies          inner join movie_credits mc1            on mc1.movie_id = movies.movie_id          inner join people actor1            on  mc1.people_id = actor1.people_id          inner join roles actor_role            on  mc1.role_id = actor_role.role_id     where actor_role.role_name = 'ACTOR'       and actor1.people_firstname = 'AMITABH'       and actor1.people_name = 'BACHCHAN'     order by movie_title, movie_year

This "tight-fit" query assumes nothing about our also knowing the name of the director, nor of a sufficient number of other actors, and hence there is no need for outer joins. Since we have already begun building our query dynamically, why not try to inject a little more intelligence in our building exercise, so as to obtain a query really built to order, exactly tailored to our needs? Our code will no doubt be more complicated. Is the complication worth it? The simple fact that we are now certain to return all the information available when given an actor's name, even when we don't know who directed a film, should be reason enough for an unqualified "yes." But performance reasons also justify taking this step.

Nothing is as convincing as running a query in a loop a sufficient number of times to show the difference between two approaches: our "tight-fit" query is five times faster than the "one-size-fits-all" query. All other things aside, does it matter if our query executes in 0.001 second instead of 0.005 second? Not much, if our database is only queried now and then. But there may be a day when queries arrive at a rate higher than we can service and keep up with, and then we'll have a problem. Queries will have to be queued, and the queue length will increase very quicklyas fast as the number of complaints about poor database performance. Simply put, going five times faster enables five times as many queries to be processed on the same hardware. (We will consider these issues in more detail in Chapter 9.)

Matching criteria with dynamically built queries improves performance by minimizing joins, and eliminates the issue of missing values.

8.4.3. Wrapping SQL in PHP

Let's first start our PHP page with a smattering of regular HTML before the real PHP code:

 <html>       <head>          <title>Query result</title>     </head>     <body>     <CENTER>       <table width="80%">         <TR><TH>Title</TH><TH>Year</TH><TR>                 <?php      ...

(Our page would probably be nicer with a stylesheet....)

Once we have our handle that represents the connection to the database, the very first thing to do is to get the values that were submitted to the entry screen. Since everything is stored in uppercase in our database we can convert the user-entered values directly to uppercase too. This is of course something that can be done in the SQL code, but it costs nothing to do it in the PHP code:

       $title=strtoupper($_POST['title']);       $director=strtoupper($_POST['director']);       $actor1=strtoupper($_POST['actor1']);       $actor2=strtoupper($_POST['actor2']);       $actor3=strtoupper($_POST['actor3']);

We now have a technical problem linked to the implementation of PHP binding. Following is the process for binding variables in PHP:

We first write ? in the place of every parameter we want to pass to the query.
Then we call the bind_param( ) method that takes as its first argument a string containing as many characters as we have values to bind, each character telling the type of the parameter we pass (in this case it will always be s for string), then a variable number of parametersone per each value we want to bind.

All parameters are identified by position (the same is true with JDBC, but not with all database systems and languages; for instance, you will refer to bind variables by name in an SQLJ program). But our main problem is the single call to bind_param( ), which is very convenient when we know exactly how many parameters we have to bind, but is not so in our case here, in which we do not know in advance how many values a user will enter. It would be much more convenient in our case to have a method allowing us to loop and bind values one by one.

One way to bind a variable number of values, which is not necessarily the most elegant, is to loop on all the variables we have received from the form, check which ones actually contain something, and store each value in the subsequent positions of an array. We have no problem doing this with our example since all the values we may get are character strings. If we were expecting something elsefor instance the year when a movie was first shownthe most sensible approach would probably be to treat such a value as a string inside the PHP code and to convert it to a number or date in the SQL code.

We can use a $paramcnt variable to count how many parameters were provided by the user of the form, and store the values into a $params array:

       $paramcnt=0;             if ($title != "") {          $params[$paramcnt] = $title;          $paramcnt++;         }

Things get a little more complicated with people names. Remember that we have decided that having a single field to enter a name was more user-friendly than having to enter the first and last names into two separate fields. However, comparing the string entered by the user to the concatenation of first name and last name in our people table would prevent the query from using the index on the last name and might, moreover, yield wrong results: if the user has mistakenly typed two spaces instead of one between first name and last name, for instance, we shall not find the person.

What we are therefore going to do is to split the entry field into first name and last name, assuming that the last name is the last word, and that the first name, which may be composed of 0, 1, or several words, is what precedes the last name. In PHP, we can easily write such a function which sets two parameters that are passed by reference:

       function split_name($string, &$firstname, &$lastname)       {       /*        *   We assume that the last name is the last element of the string,        *   and that we may have several first names        */        $pieces = explode(" ", $string);        $parts = count($pieces);        $firstnames = array_slice($pieces, 0, $parts - 1);        $firstname = implode(" ", $firstnames);        $lastname = $pieces[$parts - 1];       }

This function will allow us to split $director into $dfn and $dn, $actor1 into $a1fn and $a1n and so on, everything being coded on the same model:

       if ($director != "") {          /* Split firstname / name */              split_name($director, $dfn, $dln);          if ($dfn != "")             {              $params[$paramcnt] = $dfn;              $paramcnt++;             }          $params[$paramcnt] = $dln;          $paramcnt++;         }

Once we have inspected our parameters, all we have to do is to build our query, being very careful to insert the parameter markers for the bind variables in exactly the same order as they will appear in the $params array:

       $query = "select movie_title, movie_year "               ."from movies";       /* Director was specified ? */       if ($director != "")          {           $query = $query." inner join movie_credits mcd"                          ."  on mcd.movie_id = movies.movie_id"                          ." inner join people director"                          ."  on  mcd.people_id = director.people_id"                          ." inner join roles director_role"                          ."  on mcd.role_id = director_role.role_id";          }       /* Any actor was specified ? */       if ($actor1.$actor2.$actor3 != "")          {          /*           *   First the join on the ROLES table           */           $query = $query." inner join roles actor_role";          /*           *  Even if only one actor was specified, we may           *  not necessarily find the name in $actor1 so careful           */           $actcnt = 0;           if ($actor1 != "")              {               if ($actcnt == 0)                  {                   $query = $query."  on";                  }                else                  {                   $query = $query."  and";                  }               $query = $query." mc1.role_id = actor_role.role_id";              }           if ($actor2 != "")              {               ...              }           if ($actor3 != "")              {               ...              }          /*           *   Then join on MOVIE_CREDITS and PEOPLE           */           if ($actor1 != "")              {               $query = $query." inner join movie_credits mc1"                              ."  on mc1.movie_id = movies.movie_id"                              ." inner join people actor1"                              ."  on actor1.people_id = mc1.people_id";              }           if ($actor2 != "")              {               ...              }           if ($actor3 != "")              {               ...              }          }       /*        *   We are done with the FROM clause; we are using the old 1=1        *   trick to avoid checking each time whether it is the very        *   first condition or not - the latter case requires an 'and'.        */       $query = $query." where 1=1";       /*        * Be VERY careful to add parameters in the same order they were        * stored into the $params array        */       if ($title != "")          {           $query = $query." and movies.movie_title like concat(?, '%')";          }       /* Director was specified ? */       if ($director != "")          {           $query = $query."  and director_role.role_name = 'DIRECTOR'";           if ($dfn != "")              {              /*               * Use like instead of regular equality for the first name, it will               * work with some abbreviations or initials.               */               $query = $query                      ." and director.people_firstname like concat(?, '%')";              }           $query = $query." and director.people_name = ?";          }       if ($actor1.$actor2.$actor3 != "")          {           $query = $query."  and actor_role.role_name = 'ACTOR'";           if ($actor1 != "")              {               ...              }           if ($actor2 != "")              {               ...              }           if ($actor3 != "")              {               ...              }          }

Once our query is ready, we call the prepare( ) method, then bind our variables; this is where our code is not very pretty, since we can have between 1 and 9 variables to bind and handle, and each variable must be handled separately:

       /* create a prepared statement */       if ($stmt = $mysqli->prepare($query)) {          /*           *  Bind parameters for markers           *           *  This is the messiest part.           *  We can have anything between 1 and 9 parameters in all (all strings)           */           switch ($paramcnt)              {               case 1 :                    $stmt->bind_param("s", $params[0]);                    break;               case ...                    ...                    break;               case 9 :                    $stmt->bind_param("sssssssss", $params[0],                                                   $params[1],                                                   $params[2],                                                   $params[3],                                                   $params[4],                                                   $params[5],                                                   $params[6],                                                   $params[7],                                                   $params[8]);                    break;               default :                    break;              }

Et voilà! We are done and just have to execute the query and display the result:

      /* execute query */          $stmt->execute(  );          /* fetch values */          $stmt->bind_result($mt, $my);          while ($row = $stmt->fetch(  ))             {               printf ("<tr><TD>%s</TD><TD>%d</TD></TR>\n", $mt, $my);             }          /* close statement */          $stmt->close(  );         }       else         {          printf("Error: %s\n", $mysqli->sqlstate);         }     ?>     </TABLE>     </CENTER>

Obviously, the code here is significantly more complicated than if we had tried to have one single query.

It may seem surprising, after I have advocated pushing as much work as possible onto the DBMS side, to now find me defending the use of complicated code to build as simple a SQL statement as possible. Doing as much work on the SQL side as possible makes sense when it is work that has to be performed. But joining three times as many tables as are needed in the average query, with some of these useless joins not necessarily being very efficient (especially when they happen to be against complex views) makes no sense at all.

By intelligently building the query, we tightly control what is executed in terms of security, correctness of the result, and performance. Any simpler solution bears risks of sacrificing at least one of these aspects.

To summarize, there are at least three mistakes that are very commonly made in queries that take a variable number of search criteria:

First of all, it is quite common to see the values against which the columns of the tables are compared being concatenated with the statement-in-making, thus resulting in a magnificent, totally hardcoded statement. Even where queries are supposed to be absolutely unpredictable, you usually find a few queries that are issued again and again by the users, with only the constants varying. Some constants are susceptible to a high degree of variability (such as entity identifiers, as opposed to date formats or even status codes). It isn't much work to replace these constants by a parameter marker, the syntax of which depends on the language (for instance '?') and then to bind the actual value to this parameter marker. This will result in much less work for the server, which will not need to re-analyze the statement each time it is issued, and in particular will not need to determine each time a best execution plan, that will always be the same. And no user will be able to bypass any additional restriction you may want to add to the query, which means that by binding variables you will plug a serious security issue at the same time.
A second mistake is usually to try to include in the query everything that may matter. It is not because a search criterion may refer to data stored in one table that this table must appear in the from clause. I have already alluded to this issue in the previous chapters, but the from clause should only contain the tables from which we return data, as well as the tables enabling us to join them together. As we have seen in Chapter 6, existence tests should be solved by subquerieswhich are no more difficult to generate dynamically than a regular condition in a where clause.
The most important mistake is the one-size-fits-all philosophy. Behind every generic query are usually hidden three or four families of queries. Typically, input data is made up of identifiers, status values, or some ranges of dates. The input values may be strong, efficient criteria, or weak ones, or indeed anything in between (sometimes an additional criterion may reinforce a weak one by narrowing the scope). From here, trying to build several alternate queries in an intelligent fashion, as in the various cases of Chapter 6, is the only sound way out, even if it looks more complicated.

More intelligence in the dynamic construction of an SQL statement makes for a more efficient SQL statement.