Keys and Indexes


In the preceding chapter you met the CREATE INDEX statement, used to add an index to one or more columns of a database table.

Indexes are used to preserve a particular sort order on a column within the database itself, enabling queries in which the WHERE clause references that column directly to operate much faster. Rather than scanning a table's data from top to bottom and pulling out just the rows that match the given criteria, the index tells SQLite exactly where to look in that table to find the matching rows.

What an Index Does

A database index is not an abstract concept; you certainly have come across one before. Let's consider the index in this book, for instance, and think about when you would use it.

If you were trying to find references to the API call sqlite_exec(), for example, you might already expect to find this in Chapter 6, "The C/C++ Interface"but bear in mind that a computer program cannot think for itself. So for a moment pretend you don't have the benefit of this knowledge.

To make sure you find every reference in the book, the systematic and thorough approach is to start from page one, line one and read each page a line at a time until the end of the book, noting each reference as you come across it. Clearly, this would be a time-consuming process.

Fortunately the production team behind this book has created a comprehensive index, with topics and terms used throughout the book listed alphabetically and page references given for each entry. You've seen an index before and know how to use one, so you already know that it's much quicker to look up sqlite_exec() in the index and jump to each page in turn to find the topic you are looking for than to scan every single printed word.

How Indexes Work in SQLite

A database index works much the same way as the index in a book. Suppose you have a table containing a list of names and telephone numbers, and the data has been built up over time as you come into contact with new people. It's very likely that some of your acquaintances will share a surname, but certainly your records were not added to the table in alphabetical order.

Imagine you want to find the details for someone called Brown. The SQL SELECT statement might look like this:

 SELECT * FROM contacts WHERE last_name = 'Brown'; 

Without an index, the database has to look at each record in the database in turn and compare it to the string Brown. The more popular you are, the larger your contacts database will be and the longer this query will take to complete. This process is known as a full table scan.

However, having an index on the last_name column means that SQLite will know how to sort the records in that table alphabetically by surname, and consequently it can pick out the matching values of last_name without even looking at the other rows in the table.

Such an index could be created on the contacts table as follows:

 CREATE INDEX contacts_last_name ON contacts(last_name); 

Remember that each index has to be given its own unique identifier, and that an index name cannot share the same name as a table, view, or trigger in the same database.

Note

Although the CREATE INDEX syntax allows you to specify a sort orderASC or DESCall indexes are currently created in ascending order. This feature is reserved for possible future use.


Indexing Multiple Columns

An index can be created on more than one column at a timeknown as a clustered index. The order in which the columns are specified in the CREATE INDEX statement determines the sort order for the indexrecords are sorted on the first column initially, then on the second column where values of the first column are the same.

Think of how a telephone directory is organized. People appear in order of their surname, but because there are many people with the same surname, the listings are further ordered by first name or initials.

In SQLite, this kind of index for the contacts table would look like this:

 CREATE INDEX contacts_name ON contacts(last_name, first_name); 

A WHERE clause with conditions on both columns would find results very quickly using this index. In fact, because the primary sort order is last_name, a query that only references last_name would also perform well using this index.

The ordering of the columns for a clustered index is important. A condition imposed on first_name only would be no better off using this index than it would doing a full table scan.

Unique Indexes

You can specify the optional keyword UNIQUE in the CREATE INDEX statement to indicate that each value in the indexed columnor each permutation of values in a clustered indexonly appears in the table once.

Using CREATE UNIQUE INDEX is the same as imposing a UNIQUE constraint in the table schema, but it can be done at any time after the table has been created. If you attempt to insert a duplicate value, SQLite will return an error and the insert will fail.

In fact, using the UNIQUE attribute in the CREATE TABLE statement causes SQLite to create a unique index automatically on this column as shown in the following example. Note that the sql column is empty because no CREATE INDEX command was entered.

 sqlite> create table mytable (    ...>   a INTEGER PRIMARY KEY    ...> ); sqlite> select * from sqlite_master where type = 'index'; type        name                   tbl_name    rootpage    sql ----------  ---------------------  ----------  ----------  ---------- index       (mytable autoindex 1)  mytable     3 

Note

If a column is declared as INTEGER PRIMARY KEY, it is used internally as the key for the table structure, so the records in the table are ordered on that column by default. Although it does not cause an index to show up in sqlite_master, the column will behave as if it were indexed. Explicitly creating an index on an INTEGER PRIMARY KEY column uses additional database resources without providing any extra benefits.


When to Create an Index

An index can be created or dropped at any time without affecting the data contained in that table. A number of factors determine when it becomes beneficial to create an index.

Usually you will create indexes on columns used in conditions in WHERE statements. Whether the purpose of the index is to restrict a dataset from one table or to perform a join, generally speaking SQLite will be able to do its work much faster if key fields are indexed.

However, very small tables are often not made any faster by adding an index and in some cases, the index can actually slow things down. It can be just as fast for SQLite to do a quick scan of all rows than to look up the sort order from the index.

If a table has data inserted frequently but not queried very often, use indexes sparingly. Although indexes can speed up queriesand UPDATE statements with a WHERE clause that references the indexed columninserting data is slower when an index is present. In addition to adding a record to the table, an INSERT requires the index to be updated to reflect the position of the new data within the sort order of each indexed column.

Think of the filing cabinet in a traditional office, with customer files arranged alphabetically in the drawers. When a new document is filed, it takes a moment to locate the correct drawer and file and make sure that the item is put away in the correct place. However, when it comes to accessing documents in the future, everything relating to a particular customer is in the same place and can be found very quickly.

The alternativea large stack of paperslets you file things away much faster by just placing them on top of the pile, but looking for specific information becomes a daunting task. Just as most offices use a filing cabinet rather than an unsorted pile, most database tables can benefit from indexes when used correctly.

Don't go overboard with the number of indexes you create. The biggest performance benefits come from careful index selection based on the nature of the data stored in tables and the ways in which the application interacts with it.

If you do have too many columns indexed, it is possible that SQLite will not pick the best one for any particular query. Only one index can be used on each table at a time, so a clustered index is the only way to apply the capabilities of an index across two columns simultaneously. We will see an example of this later.

When Indexes Can Be Used

You also need to bear in mind what kinds of conditions can and cannot make use of an index. Most operators will use an index if one is available provided that the column value is not modified before it is compared.

The following examples will all make use of an index on mytable.myfield if there is one.

 SELECT * FROM mytable WHERE myfield = 'xyz'; SELECT * FROM mytable WHERE myfield > 2000; SELECT * FROM mytable WHERE myfield IN ('val1', 'val2'); SELECT * FROM mytable WHERE myfield IN (SELECT somefield FROM anothertable) 

However, the BETWEEN and LIKE operators will not use an index at any time. Such queries can usually be rewritten if using an available index would be beneficial. For instance

 SELECT * FROM mytable WHERE myfield BETWEEN 10 and 20; 

becomes

 SELECT * FROM mytable WHERE myfield >= 10 AND myfield <= 20; 

Similarly, a LIKE operator that matches the first few characters of a string can be written using two inequality operators.

 SELECT * FROM mytable WHERE myfield LIKE 'sql%'; 

becomes

 SELECT * FROM mytable WHERE myfield >= 'sql' AND myfield < 'sqm'; 

Note

A LIKE operator with characters after a wildcard symbol cannot be rewritten as a pair of inequalities. This would be like trying to find people whose surnames end with a "y" in the telephone directory.


These examples show that two conditions on the same column combined with an AND can still make use of an index. However, this is not true of the OR keywordthe IN operator should be used instead wherever possible.

 SELECT * FROM mytable WHERE myfield = 'abc' OR myfield = 'xyz'; 

becomes

 SELECT * FROM mytable WHERE myfield IN ('abc', 'xyz'); 

If some operation is performed on a column value in the condition, SQLite will not be able to use an index. For instance, the following queries will be evaluated using a full table scan, regardless of any indexes on myfield:

 SELECT * FROM mytable WHERE myfield % 2 = 1; SELECT * FROM mytable WHERE substr(myfield, 0, 1) = 'w'; SELECT * FROM mytable WHERE length(myfield) < 5; 

Tweaking the Table List

To join two or more tables together, the rows have to be combined before the results are returned, and the SQLite engine does this using nested loops. The hierarchy of the loops is determined by the order in which table names are specified in the FROM list, so that the leftmost table becomes the outer loop and the rightmost table becomes the inner loop.

Knowing a little about the tables you are performing the SELECT on can help you use this feature of SQLite to increase the speed of the query. If you use the tables that will produce the smallest number of rowsafter any WHERE clause conditions that directly apply to that table have been appliedthe outer loop will be as small as possible, and conditions evaluated on the inner loops will be performed a smaller number of times.

Benchmarking

Listing 4.1 shows a simple benchmarking shell script that can be used to execute a query repeatedly within sqlite and using the Unix time command to time execution.

Listing 4.1. benchmark.sh
 #!/bin/sh if [ $# -ne 3 ]; then   echo "Usage: $0 database-file sql-file num-loops"   exit; fi dbname=$1; sqlfile=$2; numloops=$3 tmpfile=/tmp/sqlite_benchmark.sql > $tmpfile echo "Generating temporary SQL file" for i in 'seq 1 $numloops'; do   cat $sqlfile >> $tmpfile done echo "Executing query $numloops times" time sqlite $dbname < $tmpfile > /dev/null rm -f $tmpfile 

This script is quite crude but will do enough to tell us how fast one query is in relation to another. A shell script has been used here so as not to tie it to any particular programming API, but after you have read the chapters in Part II of this book, "Using SQLite Programming Interfaces," you will be able to create a benchmarking process that suits the environment you work in.

Note

The script in Listing 4.1 uses the seq command, which is part of the GNU coreutils package, included on Linux and certain other Unix systems as standard. If your operating system does not include this command, the package can be obtained from http://ftp.gnu.org/pub/gnu/coreutils/.


The shell script version can be quite slow because it involves a lot of disk writes to generate the temporary SQL file. Using a programming API, the query could be executed repeatedly in a loop construct within the same SQLite connection.

However, to perform a loop in the shell with sqlite called multiple times to execute the same query over and over again would carry a large overhead as the program has to start up, and open and close the database file for each iteration. As we are only interested in the time taken to perform the query, the file containing multiple queries is built up before sqlite is invoked and this work is not timed.

Create a text file containing the query to be analyzed and execute the program using the following syntax. Note that because the SQL query is passed to sqlite many times from one file, the terminating semicolon must be present.

 ./benchmark.sh database-file sql-file num-loops 

The output will look something like this:

 $ ./benchmark.sh words somefile.sql 10000 Generating temporary SQL file Executing query 10000 times real    0m6.814s user    0m2.360s sys     0m1.220s 

Ideally the number of loops should be as high as you can bear to wait for the result to obtain a reliable average execution time. This will depend on the actual execution time of your query. The preceding example looped 10,000 times and took just a few seconds, but for a slow query even one loop might take longer than this.

Some Examples

The examples that follow use some relatively large sample tables to demonstrate that indexes can be used effectively. You can download a sample of the database from the Sams Publishing website at http://www.samspublishing.com/. Enter this book's ISBN (without the hyphens) in the Search box and click Search. When the book's title is displayed, click the title to go to a page where you can download the code if you want to work through these examples.

The first example has three tables, each containing 10,000 rows. This is not large compared to the sizes of databases SQLite is capable of handling, but is big enough for us to see the difference that an index can make.

The tables each contain a numeric integer between 1 and 10,000 and one word pulled randomly from a dictionary file. Table t1 uses an INTEGER PRIMARY KEY and contains each value from 1 to 10,000 in the id column. The schema for t1 is as follows:

 CREATE TABLE t1 (   num INTEGER PRIMARY KEY,   word TEXT NOT NULL ); 

Though word is not defined as UNIQUE in t1, the sample table was created in such a way that values of word are not duplicated. Therefore this table contains a list of 10,000 random English words, numbered sequentially.

 CREATE TABLE t2 (   num INTEGER NOT NULL,   word TEXT NOT NULL ); 

Table t2 contains one record for each value of word in t1; however, the num column contains a random integer between 1 and 10,000. As there is no UNIQUE constraint on num, the random way in which the data was generated has resulted in some numbers being duplicated and some being omitted from this table.

 CREATE TABLE t3 (   num INTEGER NOT NULL,   word TEXT NOT NULL ); 

The final table, t3, has a similar schema to t1 but with no INTEGER PRIMARY KEY. In fact data has been inserted so far in such a way that num is unique and every value from 1 to 10,000 is present, but to start with we do not want this column to be indexed.

Values of word were inserted randomly from the same dictionary file, which actually contains more than 15,000 words. Therefore some words from t1 appear twice or more in t3, some are omitted, and some values that do not appear in either of the other tables have found their way into t3.

This data is trivial, but it allows us to create some reasonably large joins on both num and word to see how indexes could be used. In the queries that follow, we use count(*) as the selected output because the number of rows returned will usually be large and returning the count will not significantly affect the query's execution time.

Let's start by looking at how the INTEGER PRIMARY KEY in t1 makes things faster, even though there is no visible index. Because the records in t1 are stored in the order of the num column, a comparison on this column is much faster than on t2 where the ordering of num is random.

Listing 4.2 shows query t1.sql for performing a very simple benchmark on these tables using the equals operator.

Listing 4.2. t1.sql
 SELECT count(*) FROM t1 WHERE num = 3500 AND 4000; 

The result of executing this query 1000 times is

 $ ./benchmark.sh words t1.sql 1000 Generating temporary SQL file Executing query 1000 times real    0m0.007s user    0m0.000s sys     0m0.010s 

File t2.sql performs the exact same query on table t2. Here are the results:

 $ ./benchmark.sh words t2.sql 1000 Generating temporary SQL file Executing query 1000 times real    0m11.261s user    0m9.270s sys     0m1.930s 

As you can see, the difference the INTEGER PRIMARY KEY makes is vast because SQLite knows for certain what the order of the records will be.

When table t3 was created, records were actually inserted in order of column num, but without the column being indexed, SQLite still has to perform a full table scan to be sure that it does not leave out any rows meeting the condition.

 $ ./benchmark.sh words t3.sql 1000 Generating temporary SQL file Executing query 1000 times real    0m10.898s user    0m8.880s sys     0m1.970s 

If we create an index on t3.num, the benchmark results show speeds comparable to the those of the original query using the INTEGER PRIMARY KEY:

 sqlite> CREATE INDEX t3_num_idx ON t3(num); $ ./benchmark.sh words t3.sql 1000 Generating temporary SQL file Executing query 1000 times real    0m0.029s user    0m0.030s sys     0m0.000s 

Joining any two of these tables produces a potential 100 million rows, so this is where indexes can make a real impact. Consider the query in Listing 4.3, which joins tables t1 and t2 on the num field. As t2.num is not indexed, a full table scan must be performed on t2 for each row in t1.

Listing 4.3. joint1t2.sql
 SELECT count(*) FROM t1, t2 WHERE t1.num = t2.num; 

This query is unacceptably slow, in fact so slow that it's not even worth running multiple times with the benchmark script. Just one execution took over a minute to complete.

 $ time sqlite words < joint1t2.sql 10000 real    1m3.822s user    1m1.000s sys     0m0.170s 

If we join t1 and t3 instead using the same condition, this is much faster. As each row in t1 is iterated, SQLite can find the corresponding record in t3 almost immediately using the index we just created.

 $ ./benchmark.sh words joint1t3.sql 100 Generating temporary SQL file Executing query 100 times real    0m4.721s user    0m4.400s sys     0m0.310s 

Remember that we said the order in which tables appear in the FROM list is significanthere's an example. Let's take Listing 4.3 and reverse the table order as shown in Listing 4.4. This time, each row in t2 is examined in turn and joined to t1 using the condition specified. Because t1.num is an INTEGER PRIMARY KEY, SQLite can jump straight to the row in question.

Listing 4.4. joint2t1.sql
 SELECT count(*) FROM t2, t1 WHERE t1.num = t2.num; $ ./benchmark.sh words joint2t1.sql 100 Generating temporary SQL file Executing query 100 times real    0m4.435s user    0m3.960s sys     0m0.380s 

Let's see what difference an index can make to the word column, which contains a variable-length character string, using the query shown in Listing 4.5.

Listing 4.5. t1word.sql
 SELECT count(*) FROM t1 WHERE word = 'tickles'; $ ./benchmark.sh words t1word.sql 1000 Generating temporary SQL file Executing query 1000 times real    0m8.533s user    0m6.370s sys     0m2.070s 

If we add an index on t1.word and rerun the same test, the performance is much, much better.

 sqlite> CREATE INDEX t1_word_idx ON t1(word); $ ./benchmark.sh words t1word.sql 1000 Generating temporary SQL file Executing query 1000 times real    0m0.185s user    0m0.160s sys     0m0.030s 

Our sample query just looks for a single word; in t1 and t2 the word "tickles" appears exactly once, so we could use a unique index if we want to. However, in t3 it actually appears three times, so trying to create a unique index gives a warning:

 sqlite> CREATE UNIQUE INDEX t3_word_idx on t3(word); SQL error: indexed columns are not unique 

Instead we have to use a nonunique index on t3.word.

 sqlite> CREATE INDEX t3_word_idx on t3(word); 



    SQLite
    SQLite
    ISBN: 067232685X
    EAN: 2147483647
    Year: 2004
    Pages: 118
    Authors: Chris Newman

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net