11.4. Working with Large Result Sets
If you work with databases, you will end up working with large result sets, and you will need to think about performance. How large is large? It depends. Some applications need to access the DB frequently and touch every row or perform complex queries while inserting and deleting rows, so a few thousand rows can be considered a large dataset. Other applications access only a small part of the data in well-known patterns, so the DB can be optimized and even millions of rows will not present a problem.
There are several common performance gotchas:
SQLObject is an abstraction layer that normally hides the gory details behind a magical thick veil. However, if you are responsible for the performance of your application and its efficient interaction with the DB, you must understand what's going on under the hood and be able to optimize when something is too slow. The following subsections discuss SQLObject-specific pitfalls, how they can cause one or more of the aforementioned gotchas, and some best practices.
11.4.1. Don't list() a Large Result Set. Use Slices.
Listing a large result set will bring it all to memory (or virtual memory). This is often not what you want: maybe you don't have enough memory, it takes longer to respond, and finally by the time you actually process some Nth row, it might already be stale. A much better approach is to use slices and process large result sets in manageable chunks. The following code demonstrates this:
class Bigwell(SQLObject): robot = StringCol() Bigwell.dropTable(ifExists=True) Bigwell.createTable() for i in range (1, 101): Bigwell(robot='Robot #%d' % i) # Getting the lazy SearchResults iterator (no DB access yet) sr = Bigwell.select() # List()ing the entire result set robots = list(sr) for r in list(sr): print r # Processing in chunks slice_size = 10 for i in range(0,sr.count() / slice_size): slice = sr[i * slice_size: (i+1) * slice_size] for r in slice: print r
Remember the Robots movie? The preceding code creates 100 Bigwell robot instances. Then it retrieves a lazy SearchResults iterator that doesn't access the DB, and finally it prints all the robots in two ways. The first one is simple: just looping over list(sr) and printing every robot in sight. Here is the generated SQL:
SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1
The result is that I access the DB once and get 100 robots in memory. If I insert another billion and rerun the code, I'll have all of them in memory (if I'm lucky to have that much memory). So, the code is simple but not very scalable.
The second approach is to slice the data and process it one slice at a time. The code is much more complicated, involves a nested loop, and includes some nasty index manipulation. Here's the generated SQL for the first and fifth slices:
SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 10 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 10 OFFSET 50
The result is that I have ten robots in memory (providing Python's garbage collector does its job) at any given moment. I can configure this number by changing the slice size, and it is independent of the actual number of robots in the DB. The DB is accessed again for every slice.
11.4.2. Don't Objectify a Large Result Set. Go with Straight SQL.
There's another problem that's specific to ORMs such as SQLObject when it comes to large result sets. Every row in the DB incurs the overhead (time and memory) of instantiating an object. In Python, instantiating a complicated object of an SQLObject-derived model class is especially painful. You pay this price even with slicing. If you intend to process huge amounts in a read-only manner, it might be better to get down and dirty and use the connection's queryAll() method and get a plain list of tuples. Another benefit is that you can retrieve only a subset of the columns. Of course, you will have to take care of the slicing yourself now. Finally, SQLObject caches every object that comes into existence (for example, when you insert a new object into the DB). Even if you don't refer to this object in your code, it will not be garbage collected because it's referenced in the internal cache. The bottom line is this: Objects don't go away unless you clean up the cache explicitly. You should be aware of this fact, especially if you have long-running processes. Here's some sample code that retrieves only the robot name from each row with proper slicing:
c = Bigwell._connection chunk_size = 10 for i in range(0, Bigwell.select().count() / chunk_size): query = str(Select([Bigwell.q.robot])) query += ' WHERE 1=1 LIMIT %d' % chunk_size if i > 0: query += ' OFFSET ' + str(i*chunk_size) chunk = c.queryAll(query) for r in chunk: print r
I resorted to manual string concatenation to properly slice the result set. Note the ugly condition where the OFFSET is concatenated only if i > 0. The reason is that SQLite doesn't allow OFFSET 0. However, there is no overhead of getting fields we don't need from the DB (think tables that contain big blobs), and there is no instantiation of object per row. There's just a plain list of tuples that contain the robot value of each row.
Here is the SQL query and its results for the second iteration of the loop:
SELECT bigwell.robot FROM bigwell WHERE 1=1 LIMIT 10 OFFSET 10 [('Robot #11',), ('Robot #12',), ('Robot #13',), ('Robot #14',), ('Robot #15',), ('Robot #16',), ('Robot #17',), ('Robot #18',), ('Robot #19',), ('Robot #20',)]
11.4.3. Apply the Same Method to Bulk Inserts, Updates, or Deletes
If you have massive bulk inserts, updates, or deletes, you can take the raw SQL route, too. Remember that queryAll() executes any SQL statement. There is no point in instantiating a bazillion objects just to capitalize the last name when you could do the same thing with a single update statement.
11.4.4. Don't Go to the DB Multiple Times for the Same Objects
Another common performance problem you can run into when using SQLObject is redundant access to the DB. SQLObject caches object state by default. That means that if you access multiple attributes of the same object, SQLObject will not go to the DB for every access. However, if you access a particular object or slice of a SearchResults object, SQLObject will execute a query for each access. The following code sample demonstrates this point:
robots = Bigwell.select() # Ten (5 x 2)queries of 1 row for i in range (0,5): id = robots[i].id robot = robots[i].robot print id, robot # One query of 5 rows for r in robots[:5]: id = r.id robot = r.robot print id, robot
Both code snippets print the id and robot columns of the first five rows in the Bigwell table. The first snippet accesses the robot's SearchResult object ten times (five iterations, two accesses per iteration) resulting in ten queries of one row. Each of the five rows is returned twice. Here is the SQL:
SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 2 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 2 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 3 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 3 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 4 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 4 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 5 SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 1 OFFSET 5
The second code snippet directly slices the SearchResult object and accesses in each iteration the r object. This is the right way to do it. The five rows are returned in a single query, and this query is not repeated. Here is the single query generated by this code:
SELECT bigwell.id, bigwell.robot FROM bigwell WHERE 1 = 1 LIMIT 5