Section 8.1. Deceiving Criteria

8.1. Deceiving Criteria

I already mentioned in Chapter 6 that in some queries we have a very selective combination of criteria that individually are not very selective. I noted that this was a rather difficult situation from which to achieve good performance.

Another interesting case, but one in which we are not totally helpless, is a criterion that at first sight looks efficient, that has the potential for becoming an efficient criterion, but that requires some attention to fulfill its potential. Credit card validation procedures provide a good example of such a criterion. As you may know, a credit card number encodes several pieces of information, including credit card type, issuer, and so on. By way of example, let's look at the problem of achieving a first level of control for payments made at a toll road in one of the most visited Western European countries. This means checking a very large number of credit cards, supplied by a large number of international issuers, each with its own unique method of encoding.

Credit card numbers can have a maximum of 19 digits, with some exceptions, such as the cards issued by MasterCard (16 digits), Visa (16 or 13), and American Express (15 digits), to mention just three well-known issuers. The first six digits in all cases indicate who the issuer is, and the last digit is a kind of checksum to spot any mistyping. A first, coarse level of control could be to check that the issuer is known, and that the checksum is correct. However the checksum algorithm is public knowledge (it can be found on the Internet) and can easily be faked. A more refined level of control also checks that the prefix of the card number belongs, for one given issuer, to a valid range of values for this issuer, together with an additional control on the number of digits in the card. In our case, we are provided with a list of about 200,000 valid prefixes of varying lengths.

How do we write the query to test a given card number against the valid ranges of values for the card's issuer? The following is easy enough:

     select count(*)     from credit_card_check     where ? like prefix + '%'

The where ? indicates the card number to check and here + denotes string concatenation, often done via || or concat( ). We just have to index the prefix column, and we will get a full table scan each time.

Why is a full table scan happening? Haven't we seen that an index was usable when we were addressing only the leftmost part of the key? True enough, but saying that the value we want to check is the leftmost part of the full key is not the same as saying, as here, that the full key is the leftmost part of the value we want to check. The difference may seem subtle, but the two cases are mirror images of each other.

Suppose that the credit card number to verify is 4000 0012 3456 7899^[*] Now imagine that our credit_card_checks table holds values such as 312345, 3456 and 40001. We can see those three values as prefixes and, more or less implicitly, we see them as being in sorted order. First of all, they are in ascending order if they are stored as strings of characters, but not if they are stored as numbers. But there is yet more to worry about.

^[*] An invalid card number, in case you were wondering... .

When we descend a tree (our index), we have a value to compare to the keys stored in the tree. If the value is equal to the key stored into the current node, we are done. Otherwise, we have to search a subtree that depends on whether our value is smaller or greater than that key. If we had a prefix of fixed length, we would have no difficulty: we should only take the suitable number of digits from our card number (the current prefix), and compare it to the prefixes stored in the index. But when the length of the prefix varies, which is our case, we must compare a different number of characters each time. This is not a task that a regular SQL index search knows how to perform.

Is there any way out? Fortunately, there is one. An operator such as like actually selects a range of values. If we want to check, say, that a 16-digit Visa card number is like 4000%, it actually means that we expect to find it between 4000000000000000 and 400099999999999. If we had a composite index on these lower and upper boundary numbers, then we could very easily check the card number by checking the index. That is, if all card numbers had 16 digits. But a varying number of digits is a problem that is easy to solve. All cards have a maximum number of 19 digits. If we right-pad our Visa card number with three more 0s, thus bringing its total number of digits to 19, we can as validly check whether 4000001234567899000 is between 4000000000000000000 and 400099999999999999.

Instead of storing prefixes, we need to have two columns: lower_bound and upper_bound. The first one, the lower_bound, is obtained by right-padding our prefix to the maximum length of 19 with 0s, and upper_bound is obtained by right-padding with 9s. Granted, this is denormalization of a sort. However, this is a real read-only reference table, which makes our sin slightly more forgivable. We just have to index (lower_bound, upper_bound) and write our condition as the following to see our query fly:

     where substring(? + '0000000000000000000', 1, 19) between lower_bound                                                           and upper_bound

Many products directly implement an rpad( ) function for right-padding. When we have a variable-length prefix to check, the solution is to get back to a common access casethe index range scan.

Try to express unusual conditions such as comparisons on a prefix or a part of a key in known terms of range condition; whenever possible, try to ensure that there is a lower and an upper bound.