Do It Once, Do It Right, and Forget It Forever | Advanced Microsoft Visual Basic (Mps)

In Visual Basic 6, there are many ways to retrieve and manipulate data, and quite a few places to stash it once you have it. There are also a wide variety of application architectures possible—ranging from the wildly successful to the frankly disastrous. Exactly which architecture you choose should (and now will—let's be optimistic and pretend that managers don't exist) be dictated by what you are most familiar with. This is not what it says in the books—except this one—but it is the truth. In books, the experience of the team and the existing code base are usually ignored. However, in reality we can't choose something we don't know about, and we would be foolish to choose something for a crucial and time-sensitive area of our system that we don't understand quite well. In Visual Basic, one architecture choice can be two-tier systems using Data Access Objects—or even data-bound controls—to access thousands of rows of data at a time and/or to support hundreds of concurrent users. This is not generally recognized in the Visual Basic community as a passport to success, riches, and fame. However, if it's all you know about, and if the deadline is almost impossible anyway, and if there isn't a design specification—and if, in fact, it is a typical Visual Basic project, regardless of the fact that this is the code equivalent of taking elephants over the Alps—it is probably what you will have to do.

This chapter isn't going to tackle syntax in all of Visual Basic's data access methods. There are books that already cover this. I recommend William Vaughn's Hitchhiker's Guide to Visual Basic and SQL Server (sixth edition, Microsoft Press, 1998) for a good fast start. It looks at some general principles, assesses the options, and attempts to come up with a once-only approach that can be reused (perhaps with tweaks) on many projects.

To pull off this amazing feat, we will have to make certain assumptions, and if your projects don't fit them, this approach might not, either. However, we will try to make the assumptions nice and general, and also slant them to the more generically difficult areas of data access, where much of The Mandelbrot Set's (TMS) work tends to be. These assumptions will define the kinds of systems we are asked to build—the toughest, and the ones that increasingly Visual Basic developers are being asked to build—big and scalable.

This is mainly because these are the most difficult systems to build, and if we can build these we can build anything. Although some techniques will cover systems that do not have to scale, we are looking here to cover all eventualities, including the lie that there will "never be more than 20 concurrent users, honestly, one of the directors told me."

We shall also assume that you would rather be using objects than not, and that you think tiered systems are a practical proposition and generally the way to try and go for large or scalable systems. Finally, we will assume that even if you don't have to build an Internet-, extranet-, or intranet-accessible system yet, the day is coming when some pointy-headed, simian-browed marketing person or middle-management executive (I always think "executive" derives both from "to execute" and "needing/deserving execution") will say, "Duh, I have a good [sic] idea. We should make the system work on da Web!" When you ask this sort of person for a business justification for this high-risk strategic move, he or she will answer "Duh, silly, everybody's doing it!"

Our goal is to build a component which wraps your chosen data-access technology into a simple-to-use component with a programming model that is very straightforward for the common things, yet can handle the more complex ones when required. The idea is quite simple really—get the Database Dork to put all his knowledge into a component, and then everybody else can access the data sources they need without having to worry too much about clustered indexes.

Let's Look at Data

"No data yet," he answered. "It is a capital mistake to theorize before you have all the evidence. It biases the judgment."

"You will have your data soon," I remarked, pointing with my finger.

Sir Arthur Conan Doyle, A Study in Scarlet

We shall begin by looking at data —which we originally called uninteresting. So let's see if we can spice it up a little—nipples.

Data was once central to many people's view of computing—certainly mine. I was once indoctrinated with Jackson's Structured Design and Programming. His central premise was that computer processing was all about the gap between two different but related data structures. The design process was all about identifying the data structures and the functionality required to transform one into the other. This seemed viable to me when I was working largely in a batch-processing mainframe environment. However, when I moved to working in a client/server model I encountered a dichotomy. Client/server wants to make data services available to all—usually through memory-loaded working copies—but the prime data, the source of these copies has to have a high level of data integrity. There is a tension here.

I came to see that the location of data, its scope, and how it is used combine to define the kind of data it is and how we ought to work with it. For me, there are four major types of data.

Master or prime data

This is the data that is almost invariably in a shared database server. It is the most current and most highly trusted data in the system, and represents the currently committed instance of each data entity. Usually this sort of data is accessed interactively and concurrently by many data consumers, possibly including batch processes. It is important stuff, and has to be secured properly, backed up, and have a restore-from-backup process that actually works. Because it is shared, it is usually placed on a shared server at the same or a higher scope as its usage. Thus, if the system is a departmental one, its prime data will be located on a departmental server, or one that can be accessed by the whole business. It is possible to split this data across a number of servers, but we'll talk a little more about distributed data later on.

Management or business copy data

This kind of data is most often used for reference. It is usually a read-only snapshot of prime data. It is frequently frozen in time or includes some derived summary or aggregation data, so there may be redundancy here. It is usually updated by being recreated through a regular refreshment process, so there might be versions of the same data kept for a period. It is essentially static, unchanging data, so no problems of synchronization arise. This data could be centralized—there's a current vogue for data warehouses—but it could just as easily be distributed, even to the point of being in individual spreadsheets on particular users' machines.

User or operational copy data

In addition to management having copies of the prime data for particular uses, it is quite common for users to have their own copies, usually made with an eye to improving some level of operational efficiency, performance, or service. For instance, a laptop user might carry a copy of his or her own client's records while on the road, to increase response time rather than having to phone the office or wait until returning before checking something. This data is a duplicate. Whether it will need to be synchronized or simply refreshed periodically in the same way as Management Copy Data will depend on its scope and usage. If the responsibility of updating the data is shared, this increases the complexity of synchronization.

However, operational copy data could equally be static data, such as look-up data, which is rarely if ever changed, and if so might be copied to a departmental server from a central source, or even down to individual user PCs to speed the process of consulting it.

Operational copy data can be a subset of volatile prime data copied to a specific location server. Most updates might be made to this local copy, but some might be made to the prime data, in which case a process of synchronizing and reconciling the local copy and the prime data needs to be set up. This is usually done via some transaction logging mechanism, but here the issue of just how real-time the synchronization has to be raises itself.

Work-in-progress data

This is where (usually, we hope) a copy of a subset of prime data is made in order to carry out some specific task. For instance, imagine a software project manager accessing all the timesheets for a given project in order to update the project's progress, write a weekly report, or reconcile progress in time against progress measured by testing. Here a number of changes might be accumulated against a subset of the prime data, and after a particular series of steps for the task is completed, the data with changes is all committed by sending a single transactional unit back to update the prime data. Depending on its size, it can be kept in memory while it is being worked on (an option not without its challenges). However, even if this is the intention, the data can find itself spilling into temporary storage on disk if the machine is under-specified for the task—merely through the ministrations of the operating system. Alternatively it may be stored on disk anyway, in an attempt to avoid major recovery problems. This is usually the data with the shortest working life.

Tiering and Data

Let's get two-tier out of the way

Having said earlier that we will assume you want to do it tiered (I must get some T-shirts printed: Real Developers Do It In Tiers), I'll start off with two-tiers. I have a good reason for this—to cover choices between DAO, ODBCDirect, Remote Data Objects (RDO), and ActiveX Data Objects (ADO), and also to look at the most successful paradigm for this kind of development. Understanding two-tiers will also give us a basis for noting how familiar Visual Basic practices have to change when you tier.

Where's the data?

This might strike you as a straightforward question—it's in the database, isn't it? Well, the architecture we choose decides where the data is, partly because it will tell us what parts we will have. One way or another there will be data in all the parts (or, at least, data will be required by all the parts). So let's look at some architectures, find out where the data might be, and examine how to access and manipulate it.

Let's consider a typical client/server usage scenario and see how it suggests itself for database usage. A typical scenario, from the user interface perspective: a customer phones your company and wants to place an order. The order clerk has to first find the customer's details (let's assume this customer is already in the system) by inputting some values. The relevant customer details are displayed (perhaps a picklist of possible customers comes back, and the clerk has to do a little narrowing down to get the right customer and customer number to attach the order to). Once the customer number is available, the customer starts ordering stock items. Each item's availability has to be checked before an order line can be added to the order. Once all the order lines are in place, the clerk is supposed to quote the order number and the total, and if the customer is cool with it all, the clerk places the order.

From the server perspective, a customer picklist has to be created based on some user input. Relevant customer details have to be returned when the customer is positively identified. Stock items and their availability have to be checked for each order line. Finally the order has to be placed. This entire transaction has involved four distinct interactions with the database. The first three are trivial and straightforward in that they are read-only. Let's examine these first.

An ADO Tip
The number of rows that ADO fetches and caches is decided by the Recordset object's CacheSize property. If you are looking at rows in the cached row range, ADO gets that data from the cache. As soon as you scroll out of the range of cached rows, ADO releases the cache and fetches the next CacheSize number of rows back for you and puts them into the cache. Once you have data flowing back from a production size database and being used, you can watch how your application uses data. What you see can lead you to tune the CacheSize property, and this can reduce the number of network trips for data. Reducing network trips is almost always a cool thing to do.

Getting Data Back

This was an unexpected piece of luck. My data were coming more quickly than I could have reasonably hoped.

Sir Arthur Conan Doyle,

The Musgrave Ritual:

The Memoirs of Sherlock Holmes

All we want to display in the user interface is read-only data. So we confront the cursor decision, a danger to all clean-living developers who want to keep their language clean. Some database developer's suffer from a kind of Tourette's syndrome, and can only choose cursors. Most know, however, that cursors are a very expensive way to fetch data and are rarely used in production-class applications.

Though I had hoped to avoid it, I knew that I wouldn't be able to escape talking about cursors here. So let's look at selecting an appropriate data-access strategy. This is more than just picking between RDO, ADO, DAO, ODBCDirect, ODBC API, and proprietary APIs. It is even more than picking what kinds of cursors (if any) you will use. It is also about deciding when and where we will open connections to the database, how long we will hold them, when and how to use transactions and locking, and how we will get our data to all the pieces of our application in this distributed, component-based architecture we are considering.

First let's deal with the choice of data-access libraries. If you are starting a new project in Visual Basic 6, and you have no particular allegiance (i.e. you or your team are not very skilled in another technology and don't have a large investment in an existing code base), the choice is really a strategic no-brainer—choose ADO. However, if there are other factors (for example, you are starting with an MDB back-end and moving to an RDBMS when ODBCDirect suggests itself), weigh the decision up. As a little aide-memoire ("memory aid" for the uninitiated), the companion CD includes a spreadsheet that might help.

The latest version of ADO (ADO 2.0, which ships in Visual Basic 6) is fast. (Check out RDOvsADO in the sample code for this chapter.) It has a low memory and footprint size—although until OLE DB providers for your databases are available, you may have to use an existing ODBC driver along with Kagera [MSDASQL.DLL], the ODBC generic OLE DB provider. Some are shipping with Visual Basic 6 (Jet, Oracle, and SQL Server, for example). However, if you do need to use Kagera, this ups the total footprint size (and incidentally, it doesn't support independently addressed multiple resultsets). ADO 2.0 also now supports events, so you can connect and fetch data asynchronously and react to events when operations have been completed. ADO creates free-threaded objects that are speedy, especially for Web-based applications. If you are familiar with RDO, you'll find navigating the ADO object hierarchy a little different, mainly because you don't have to. Most important ADO objects can be instantiated independently. This seems more convenient, but under the covers many of the other objects in the hierarchy need to be instantiated—albeit not by you directly—and this necessitates more parameters on the instantiation methods you do have to call. ADO allows batch updates, as does RDO. (One of the most interesting things, from a tiering perspective, is the remotable resultset, but we'll come to that later. It has some particularly interesting ways of handling parameter queries).

And now for cursors. At the risk of trying to teach my intellectual betters, I have included a sidebar on cursors. Experienced database programmers will, or at least should, know about cursors. But a number of Visual Basic programmers are a little vague, and this information is basically a prerequisite for making an informed decision on a data-access strategy. If you are confident in your cursor knowledge, feel free to skip the sidebar—I always have difficulty deciding whether and when to read sidebars myself. If you are going to, now is a good time.

We're back from the valley of the cursors, and if you remember, we've dealt with getting the data we need down and into our user interface. Now, let's deal with changing and updating it.

Beefing Up Two-tier

The two-tier architecture is pretty widely used today and generally scales to somewhere under the 100-user mark. Standard two-tier suffers from notable limitations:

Too much data is pulled across the network and dealt with at the client.
"Functionally rich," fat clients require extra maintenance and reuse limit reuse.
Databases only scale to a certain number of users.

Using stored procedures in the database is a means of extending the limits of a system. This means you can typically go from a system which in two-tier has a ceiling of 100 users and often as much as double that ceiling.

How do stored procedures do this? To start with, stored procedures can improve database scalability, since they use a database's own optimized code to efficiently process data, and may even use the database's SQL extensions to do specific tasks more efficiently. Stored procedures are also stored in an efficient manner (usually in some precompiled, ready-to-rock state) so they are quick to start as well as rapid to run. Obviously, because they are within the database itself, stored procedures avoid dragging lots of data across a network, since by their nature they must process the data at its source. Stored procedures are frequently used as a way of restricting data access, in that they can be restricted to specific user groups or even individuals. This means that the stored procedure—or its cousin the database view—is in fact an encapsulation of the data. This can help to avoid the necessity of changing the application if the database structure has to change. There's also an element of reuse with stored procedures: different interfaces and applications can reuse the same stored procedures.

But the land of stored procedures isn't all milk and honey. SQL is not as powerful as more general programming languages such as Visual Basic. Long-running stored procedures can tie up database connections. Stored procedures, again due to their nature, must always run on the same machine as the data, which limits and essentially eradicates hardware scaling of an application.

Cursors—Just Say No: Or, How I Learned to Stop Worrying about Locking and Love Read-Only Data¹
The big three in Visual Basic data access libraries—DAO/ODBCDirect, RDO, and ADO—can all implement cursors. The kind of cursor you choose has a vast impact on the performance, concurrency, and thus scalability of your system. Getting it wrong can really hurt. So it helps to know your cursor options carnally.
…in the evening we found a reply waiting for us at our hotel. Holmes tore it open, and then with a bitter curse hurled it into the grate.
—Sir Arthur Conan Doyle, The Final Problem:
The Memoirs of Sherlock Holmes
Choice of cursor option primarily affects where a resultset will be built, how it will travel to its consumer, how an application can move through the data, and whether and when data in the cursor is updated. For a long time there have been two ways to look at cursors. The first is the lazy, sloppy, thoughtless way (at TMS, the shorthand for this is "evil cursors"). Rather than thinking carefully about what cursor functionality is required in each given data access situation, and then choosing a cursor that provides a close fit and as minimal and lightweight a set of functionality as can be lived with, many developers choose the fullest functionality and thus the heaviest and most expensive cursor option available. The principle here seems to be, "If it can't be done with the biggest cursor, it can't be done in Visual Basic." In addition, some developers feel that a generic cursor solution is best, rather than taking the trouble to understand the benefits and drawbacks of specific cursor options. In reality though, time spent choosing the most appropriate cursor options is time well spent. If we don't spend it here, much more time will be squandered by our frustrated users as they wait for data to travel across the network or for locks to be released.
The following is a list of options you must decide on when selecting a cursor.

Location This is where the cursor will be physically built—there is also a link with membership, in that the combination of location and membership can control where the data and keys are and how and when they move from client to server. Location has the following options:

Client The cursor is built on the client. If it is a static cursor, all the data travels across the network at one time, down to the client. This raises questions of capacity—does the client have enough RAM and disk space for this? If the resultset is large, and especially if few of the records in it will become current records, this might be a bad choice. Don't get fooled into thinking that the client is the sink client (where the user interface is) if you open a client-side cursor on a middle-tier machine. When you finally have the resultset on the client, performance can seem fast. As a general rule, client-side cursors scale better, since you are spreading the load across many workstations.
Server Some databases support building the cursor on the database server machine in a temporary area. SQL Server does this and caches the recordset in the TempDB. Then when the client wants to make a given record the current record, the data for that record travels from the server to the client. Thus all that is on the client at any given time is the current record's data. This is mainly useful for lightweight clients and large recordsets, especially if you are only going to directly access and manipulate a few of the rows in a large recordset. If you use the server option wisely it can greatly improve the effect on network traffic. So this option can be the best choice when there isn't much spare bandwidth on a network, or when you have to have a lot of network traffic using client-side cursors. The cursor has to be fully populated before control goes back to the client code, so it can sometimes feel a bit slower, especially if you like to pull the trick of trying to show the first row in the user interface as soon as it is available, and then putting the others in while the user is ogling it. Also, try not to have too many of these, especially if they are big and open at the same time—the TempDB is finite, too. If you use these you can't have disconnected recordsets (see "Returning Changes from a Disconnected Recordset" later in this chapter). Opening server-side cursors in transactions can have locking implications such as blocking other users. Note also that server-side cursors don't support executing queries that return more than one recordset. Alternatives to server-side cursors are using a client-side cursor, or preferably a disconnected recordset.

Membership/Cursortype This refers to which rows are in the cursor and which aren't, and when the database engine picks them. (Initially the query processor decides based on the WHERE clause and puts in the resultset.) Also included here is what is in the cursor, in terms of data or just keys (IDs of records). This has an impact on what travels across the network when. This has the following options:
Static Static cursors usually retrieve a copy of all records in the cursor's membership. Membership freezes when the last row is found (usually when code or database engine moves the current row pointer to the last row), so data changes made by other users don't join the membership. Often some sort of locks (usually share locks) are kept in place until this is decided. Usually this means that all the data travels from the database engine across the network to the client and has to be stored there. Static cursors are usually fully scrollable.
Keyset The cursor is built with a key for each record in the membership. Although membership is usually static in that new rows won't be added to the membership, in this case the keys (the IDs) of the records are what initially travels to the client. When a given record becomes the current record, a request goes back to the database for the data for that record, using its key. This means that at any time all the keys plus the data for the current record are likely to be on the client. Because keys, not full records, are being stored, the overhead is less than for static or dynamic cursors. This can help when not all records will be touched (become current record) or where the client has limited data space.
Dynamic Some cursors never close membership—instead they keep asking the cursor driver to requery the database to make sure that rows added, updated, and deleted are taken into account in the membership. This can be very costly to performance, chewing up lots of memory and network resources. Dynamic cursors can change their membership to see changes made by other users. They are fully scrollable and normally updateable—the gold-plated cursor.

Scrolling This is the capability provided by the combination of database engine and cursor driver to let you move from row to row. The following are the scrolling options:
Scrollable This allows capabilities such as bookmarks so that you can mark a record as you scroll, move the current record and then return to it later, or jump up and down a recordset using MovePrevious and MoveFirst type functionality. This is expensive functionality. Usually most applications read through a resultset once to put it in the interface, so they don't need this.
Forward-only This is not really a cursor, but it's usually the first choice, especially when read-only. This provides the capability to go from start to finish sequentially (FIFO—first in first out) through the resultset, making each item in it the current record in turn. This is adequate for most applications bringing back large numbers of rows at a time. (By "large" here we often mean more than one!) The forward-only cursor is generally the fastest and the lightest on resources.

Updateability This is how your database and cursor driver let you go about changing membership and data values. Updateability has the following options:
Updateable You can maintain data (add, delete, and update new rows) using the cursor's functionality—AddNew, Delete, and Edit/Update methods on a resultset object are how this normally looks. If you don't use this option (which again is very expensive functionality), you can use action queries. This option is expensive because each row has to be individually addressable, which normally requires a primary key on the table. Use Read-only if you aren't going to update, since it is more efficient.
Read-only This is exactly what it says on the side of the can. This option is cheap on resources and is usually an excellent choice, especially when combined with Forward-only scrolling.

So what do RDO and ADO give you in cursors? Rather than having you wade through another huge table, check out the Visio diagram included on the companion CD. The code for checking this is in the sample RDOvsADO.
So when picking a cursor, you should consider what the functionality needs, and then find the cheapest cursor option to implement that will provide it. Once you've made this decision a few times you'll probably have been through all the normal scenarios—such as

Scenario Cursor Choice Other Information

Look up data Read-only, Forward-only, Static On the server

Provide a picklist to select from Read-only, Forward-only, Static On the client if the number of list options is small, on the server if you have the option and have to provide large numbers. We always try and restrict this kind of query to a maximum size of 50-100 rows (who is going to scroll more?) and show the first 50-100 if more are found, but prompt them to qualify their query more

Bring back details of one item (chosen from a picklist) into a Detail/Edit screen Read-only, Forward-only, Static On the client—we'll do the up-dates/delete using an action query if we can (not methods on a recordset, but we'll talk more about this later). If you need scrolling, use a client-side keyset cursor

A useful concept here is that of the current row: essentially with most resultsets and cursors, only one row of data is "exposed" for dealing with at a time. Scrollable cursors can vary in a number of ways. (Go to the first, go to the last, go the next, go the previous, go to a specific one, or go to one in an absolute or relative position.)
You to your beauteous blessings add a curse,
Being fond on praise, which makes your praises worse.
—William Shakespeare, Sonnet 84
Cursors are rarely the best way to access your data. But sometimes you need to be able to do the following:

Scroll forward and backward (browse) through a limited resultset
Move to a specific row based on a saved value (a bookmark)
Move to the "nth" row of a resultset in absolute or relative terms
Update limited resultsets created against base tables using the RemoteData control

If you don't need a cursor, you will usually find your application is faster and more efficient. So, how to avoid them? With RDO, if you set the rdoDefaultCursorDriver property to rdUseNone, all resultsets created by your application will be created as if you used the OpenResultset method with the rdOpenForwardOnly and rdConcurReadOnly options set, and with the RowsetSize property set to 1. This is often the most efficient way to pass data from the remote server to your application. ADO doesn't have such an equivalent setting (well actually, that is a lie—it has an obsolete setting, adUseNone, which exists only for backward compatibility)—ADO's default cursor location setting is adUseServer.
Although it is also possible to create low-impact resultsets that are also updateable through use of the Edit/Update methods, this is usually impractical because the base tables are not directly updateable, so creating an updateable cursor is not possible. This is particularly so in stateless tiered designs.
Whenever you create a resultset with a stored procedure, the resultset is not updateable—at least not using the Edit/Update methods. In these cases, you can use the WillUpdateRows event to execute an action query that performs the actual update operation(s).
However you create a cursor, you can usually update the data using one of the following techniques—even if the cursor is not updateable:

Executing a stored procedure that updates a selected row based on a code-provided key.
Executing an action query that changes specifically addressed rows. In this case your code creates a suitable WHERE clause used in the query.
Using the WillUpdateRows event to trap update operations and substitute appropriate stored procedure calls to perform the actual changes.

Try not to update directly through cursors—it's almost always more efficient to use a stored procedure or an action query. If you are using ADO, use the adExecuteNoRecords execution option—this stops ADO setting any cursor properties.
Dim conn As New Connection Dim sOpen As String, sExecute As String sOpen = "Provider=SQLOLEDB;Data Source=TMS1;" & _         "Database=DUDES;User Id=sa;Password=;" conn.Open sOpen sExecute = "Insert into DudeMain values(5, 'Chris'," & _            " 'Star Programmer')" conn.Execute sExecute, , adExecuteNoRecords 
It is even harder to justify using a query that creates an unrestrained, off-the-leash kind of cursor against one or more base tables (e.g. "SELECT * FROM Table"). Not only is this not a permissible option in protected systems, it can cause serious concurrency problems as you attempt to scale your application to more than a few users. Whenever you create a cursor, be sure to limit the scope to the fewest number of rows possible. In interactive systems (where there is a human operator), fetching more than a few hundred rows is often counterproductive and leads to increasingly complex concurrency problems.

Scenario	Cursor Choice	Other Information
Look up data	Read-only, Forward-only, Static	On the server
Provide a picklist to select from	Read-only, Forward-only, Static	On the client if the number of list options is small, on the server if you have the option and have to provide large numbers. We always try and restrict this kind of query to a maximum size of 50-100 rows (who is going to scroll more?) and show the first 50-100 if more are found, but prompt them to qualify their query more
Bring back details of one item (chosen from a picklist) into a Detail/Edit screen	Read-only, Forward-only, Static	On the client—we'll do the up-dates/delete using an action query if we can (not methods on a recordset, but we'll talk more about this later). If you need scrolling, use a client-side keyset cursor