Choosing the Right Database Management System | Hitchhikers Guide to Visual Studio and SQL Server: Best Practice Architectures and Examples, 7th Edition (Microsoft Windows Server System Series)

I expect you're reading this book because you or your customers have already planned to use SQL Server or have an existing system that uses SQL Server and you need to be able to access it. I also expect my readers to be those migrating from Oracle, DB2, JET, or other databases and need some guidance on how to best transmogrify their applications and DBMSs to SQL Server. I am well aware that SQL Server might not be the best choice for every application. However, since it comes in so many "sizes," it certainly applies to the vast majority of application architectures being used today. There are many factors to consider when choosing a database management system, as I discuss here.

What Does SQL Server Cost?

Except in some back-street shops in Hong Kong, Microsoft SQL Server Workgroup Edition or better is not cheap. The licensing fees are computed on a per-system or per-CPU basis, where each additional individual user is required to purchase a Client Access License (CAL). See Microsoft for details, but you'll find that the cheapest (nonfree) version (SQL Workgroup edition) with five CALs runs about $739. That makes it pretty expensive to bundle SQL Server Workgroup edition with your small office accounting application unless you plan to charge quite a bit for it or expect the customer to have the right version already installed. You also need to consider licensing costs for the operating system that hosts SQL Server. The Workgroup edition can be hosted on Windows XP (SP2) or greater, Windows Small Business Server 2003 (SP1) or greater, and Windows 2000 (SP4) or greater. MSDE and SQL Server Express editions can run on the same mix of OS versions, but that still leaves out your customers still running Windows 95, 98, or ME, which are no longer supported. Other costs associated with SQL Server include deployment, setup, data loading, training, and administration, which vary quite a bit from implementation to implementation.

Thankfully, it's also possible to build small systems around the "free" MSDE (SQL Server 2000) or SQL Server Express editions. Sure, you'll want to develop against the Developer ($50) or Workgroup editions but deploy with the free MSDE or SQL Express editions. These editions are performance and scalability-limited, which make them inappropriate for systems that call for higher scalability. However, they're ideal for single-user applications that have to deploy with a DBMS, smaller office applications, low-volume websites, or even fewer-user multi-user systems. I have heard from a number of (very) large companies that have implemented many tens of thousands of MSDE-based systems in their organizations as point-of-sale systems, small accounting and registration systems, and other fairly sophisticated implementations. I further discuss these free editions and the special restrictions they impose in Chapter 2, "How Does SQL Server Work?"

What About SQL Server Security?

Any system you build needs to address security issues from the very earliest stages of the designbefore you choose a strategy or DBMS engine that does not have sufficient power to protect your customer's data. Security comes in many forms, degrees, and flavors. Of course, a secure system prevents data loss and prevents unauthorized data access, and, ideally, a secure system can be physically stolen without the data being compromised. While you might not be able to afford the amount of extra overhead needed to implement a totally secure system, you're obligated to provide at least the basics. Fortunately, a healthy percentage of the security in a SQL Server database management system is implemented automatically by the SQL Serverunless you disable it. SQL Server has a wealth of security features that make sure the data is correctly and safely stored. JET and many other "home" database engines have far fewer of these features, and those features that are supported are often done so at a very shallow depth. Recommended security functionality includes (but is not limited to):

Integration with Windows domain authentication or SQL Server login/password credential validation. This permits you to define specific users or roles that can be granted or denied specific rights to the database(s), tables, and columns, as well as procedures and views.
Encrypted (SSL) data transmission over the channel to and from the server and the ability to encrypt any or all data stored in the system using custom stored procedures (which can be written in T-SQL or custom CLR code). In SQL Server, this is implemented with Transport Layer Security (TLS).
Proprietary datatypes implemented in CLR code. These custom datatypes can prevent "bad," malformed, or out-of-bounds data from being stored in the database in the first place. These checks can also be implemented with rules and custom constraints.
Data transaction logging ensures that data that's supposed to be saved to the database is actually committed to the disk to maintain data and referential integrityeven if the server goes down before the data is physically written to the database. This mean the server can automatically "repair" the DBMS by rolling back uncommitted transactions and rolling forward any committed transactions that were not (yet) written to the server. It also means that transactions can include two or more servers with the DBMS engine providing "two-phase" commit as required.
More sophisticated systems support log shipping and data mirroring to move critical data off the server, so it can be more safely protected and more quickly restored and brought back online when things go awry.
Data validation subsystems that permit you (as a developer) to specify defaults and rules for the data so that regardless of the source of the data, it meets specific criteria such as range, datatype, and value. These can be implemented in SQL Server by Constraints, Rules, Triggers, or other server-side executableseven those written in a CLR language.
Stored procedures (triggers) that can be executed each time a row is inserted, changed, or deleted, to ensure that the data change does not violate business rules or referential constraintsthat you define.
Referential integrity and index subsystems that ensure that duplicate data rows are not inserted by accident and that parent-child relationships are maintained to further protect referential integrity and to improve query performance.

As you compare your current DBMS with SQL Server or as you try to decide whether or not you need these features, consider that these same features are available in the "free" MSDE and SQL Express versions of SQL Server. Without them, your data (and your job) is far from secure.

IMHO

ADO (or any data access query interface) is not usually the source of performance problemsbut it can be, especially when compared to low-level protocols.

What About Performance?

I constantly hear about "performance" and making applications run faster. Everyone seems to be concerned about performance differences between ADO, ADO.NET, DAO, or LSMFT^[8]? Naturally, when choosing one data access interface or DBMS engine over another, developers want to be sure that the application they're designing is more responsive, supports more users, and is more competitive than the application they are replacing or competing with. Before getting into ADO.NET's performance, I want to touch on another point. ADO.NET (or any data access query interface) is not usually the source of application performance problemsfrankly, it's rarely the problem. Far more frequently, I've found that expensive (a.k.a. "dumb") queries, improperly configured tables, incorrect indexes, and other factors like bone-headed designs and simple ignorance play a far larger role in overall application performance than improperly coded ADO.NET routines. The problem is not how quickly your program (ADO.NET) asks the question, but how long it takes the network and SQL Server to return the answer. Sure, ADO.NET can be slow if you return too many rows for it to handlebut whose fault is that? Remember, it's designed to be a query interface, not a bulk operations interface. So can you blame poor performance on the database engine? Those that do are on pretty shaky ground. SQL Server as implemented today wins TCO/performance comparisons in some of the most complex and demanding DBMS installations in the worldespecially when price of the hardware is considered as a factor. Sure, Microsoft sometimes uses a bit of smoke and mirrors when building marketing demos, but I've interviewed enough developers (and their managers) to know that SQL Server more than meets their needs for a deadly serious high-performance platform.

^[8] LSMFT: One of the first acronyms. It means "Lucky Strike Means Fine Tobacco."

When you start looking for a performance bottleneck, look first to your database design and configuration and next to the queries you're submitting for execution. For example, INSERT performance can be hampered by too many indexes, but SELECT performance can be hampered by not enough indexes. Overall performance can also be crippled by sheer volumetoo many rows being returned, sorted, filtered, related, or updated at once. By asking the server to return an entire table, state, region, or country, you can transform a relational database into a file server and your client application into a leopard slug^[9]. I'll talk about appropriate high-performance query syntax in Chapter 2 and in subsequent chapters throughout the book.

^[9] www.fcps.k12.va.us/StratfordLandingES/Ecology/mpages/leopard_slug.htm

IMHO

It's often not how quickly you ask the question, but how long it takes the DBMS engine to return the answer.

Performance Metrics

Benchmarking performance is one of the most complex and exacting data processing sciencesand one of the most misunderstood and misapplied. In order to make accurate (and fair) comparisons, you must set up a level playing field (if one exists) so that the hardware platforms don't favor one system over another. But what if they do? Perhaps, the fact that system X runs best on platform Y but other systems do not should play a role in the evaluation. For example, if you discover that SQL Server runs better on hyperthreaded or dual-core processors, but the alternative does not and the target platforms are HT systems, then SQL Server would make more sense in that situation. I recommend using a typical production client or server system as a testbed so your tests can better reflect reality.

Without boring you with a chapter on how to set up and interpret the results of a benchmark, consider that one mistake that many folks make is testing with insufficient data, using the wrong kind of data, or using atypical queries. The Pubs, Northwind, and AdventureWorks database are not nearly large enough to really judge anythingunless you're planning to put them and the data they contain into production as is. When testing performance, it's hard to beat dealing with real datanot made-up data rows that don't reflect the actual variety and distribution of your customer's data. To give a simple example, when SQL Server executes a query against a table, it can choose which index is most appropriate. Suppose your sample Customer table consists of three million rows of a duplicated "Mrs. Smith" record indexed on last name and date of birth, but only one of those Mrs. Smith records has different data in a non-indexed field. Your test query's WHERE clause matches "Mrs. Smith" and the non-indexed field. At best, the performance you'll see reflects a table scanno matter how hard you try. If you compare that result against real data that has only 300 "Mrs. Smith" records interspersed in 3 million other rows, you'll see an efficient index scan over 300 records rather than 3 million. The point here is that when weighing different performance approaches, be careful that you are weighing the same thing.

IMHO

Just because a duck floats, it does not make a woman a witch because she doesn't.

Comparing ADO and SQL Server vs. DAO and JET Performance

Consider that ADO is and always has been a query interface. Until Version 2.0, ADO.NET did not have any way to move data in bulkat least, not efficiently. Unfortunately, developers tend to use bulk data transfers to measure ADO's performance. For example, they compare how fast ADO can retrieve 10,000 or a million rows against some other DBMS like Paradox, Oracle, or JET. Using this technique to benchmark is like comparing the speed of a bicycle against a Porsche 911 in Redmond rush-hour traffic (with the cops around). ADO, ADO.NET, and SQL Server have been compared countless times with DAO and JETand JET often wins. When both DAO and ADO.NET are used to access JET, DAO often wins. Despite this apples-and-can-opener comparison, these tests sometimes come out in JET's favor because DAO is a "native" interface to JET. In contrast, ADO.NET uses a one-size-fits-all OLE DB interface designed to access any number of different types and classes of data sources. All too often, developers migrating from DAO (and JET) bemoan ADO.NET's (and, thus, SQL Server's) performance. Sure, DAO and JET are often faster than ADO.NET and SQL Server for some operations (especially when talking about bulk data transfers or simple index-seek operations), just as a flat-file is far faster to read than setting up and executing a JET query and fetching the data. ADO.NET (and especially ADOc) moves a lot more metadata over the wire that the lower-level interfaces don't bother withthey assume the application knows the schema and structure of the data being sent and received. Yes, ADO.NET is better at this, as it separates out the metadata from the "real" data. In some cases, this metadata can make your application safer and smarter, and make it easier to update the data when the time comes. But most of the time, it just gets in the way. You can minimize this overhead by choosing when to use the DataAdapter or TableAdapter Fill methods and when to simply use a DataReader to execute queries. But I'm getting ahead of myself. All through the book, I'll show the cost of each data access technique and how to best optimize your network, RAM, and CPU resources on each end of the wire.

What's Fast Enough?

As a developer, you should be concerned with achieving "acceptable" performance for the types of queries your application must execute and the ability to scale those operations to support the load demanded by the expected users per second. Consider that a server architecture (like SQL Server) actually gets faster as more users (doing roughly the same things) are added (up to a point)not slower (like JET). Why? It's because as users perform roughly the same operations on roughly the same data, more relevant data and procedures are fetched from RAM caches and do not have to be fetched from disk. In (stark) contrast, JET and other similar ISAM database management systems cannot "share" or reuse data pages fetched by other users, as they are distributed client-resident engines and do all of their physical I/O over the Net and lightly cache only what the individual client is accessing. They have no idea what other clients are doing beyond the lock status of one or more rows or tables. This means JET can support far fewer users and is governed by the speed and capacity of the Net and available CPU cycles on the client system.

Sure, if you're building a single-user application, SQL Server (even MSDE or SQL Server Express) can seem like overkillJET or another lightweight DBMS engine can be very appealing. However, how many times have you been asked to convert a simple single-user application to multi-user? I've heard of this happening quite often as a small-office application gains popularity and others want to share the data and functionality. I've learned that if an application is not designed to be multi-user from its inception, it's not likely to succeed as such after conversion from single-user.

But what about SQL Server Everywhere (SQLEv)? This new DBMS has the potential to replace many JET databases where a simple, fast, and light data store is called for. In contrast to JET, SQLEv is fully encryptable and can be an ideal local repository for client data. I discuss this in detail in my EBooks on SQLEv.

Optimizing Developer Performance

As you design your application, a more compelling scenario needs to be considered. It concerns developer performancethis is especially important for development and IT managers. Developer performance is measured by how many solutions can be designed, developed, tested, deployed, and supported in a given length of time with a given set of developers. It's not how many lines of code are developed in a given a length of timeas developers, we can easily "bulk up" our programs if we're being paid by the line. When you build any application, you and your team learn, tune, and leverage specific skills. The techniques you apply control the specific features of the DBMS that make your application perform well and integrate well with other applications. The exception handlers you write, the administrative functions you include, and many other aspects of the application are specifically tuned to accommodate that specific DBMSwhether it's JET, SQL Server, Oracle, or 3 x 5 index cards. This makes sense.

However, when your customer asks you to scale the application or simply start from scratch and create a new application that's clearly out of your current DBMS's reach, the skills and techniques you and your team have honed may no long apply. As a result, you as a developer are not nearly as useful to your customer or employeryour entire career might be placed in jeopardy. If, on the other hand, you learn to use ADO.NET and SQL Server, you can leverage skills, techniques, schemas, procedures, views, and a wealth of other resources for a project of almost any scale. Retooling the design for more users is not that great a stretch, given your new skillset.

A number of respected experts suggest the developers build intermediate "object" layers to which the development team codes^[10]. This can eliminate the need for your team to learn ADO.NET (or the data access interface dejure), but it does require more training and discipline to learn how to design, implement, use, and support this custom interface. The Microsoft Data Access Application Block (DAAB) is a variation of this approach. When you decide to switch your backend to another DBMS engine, (in theory) you can do so by replacing the data object layer. Ah, yes, but this is also fraught with complexities and hidden dangers, as there always seems to be unintended side-effects when removing one component or layer of components and replacing it with the "equivalent."

^[10] See Building Applications and Components with Visual Basic .NET, by Ted Pattison and Joe Hummel (Addison-Wesley Professional, 2003).

Another issue your developers are going to have to deal with is "builds," service packs, and the intermittent flow of "fixes," patches, and upgrades that pop out of Microsoft and the other vendors on which we all depend. Once your application is deployed, it must (somehow) continue to work even when many of the interfaces it uses are evolving. While .NET architecture alleviates some of these problems, I've still seen situations where upgrading the existing .NET framework breaks deployed applicationsas when a service pack is applied to the framework. I've also seen problems purportedly caused by the tighter (somewhat anal) security "improvements" in Windows XP SP2. For the most part, if you're reading this book, you're pretty smart and you should know how to debug and work through most of these upgrade issues. Unfortunately, your users aren't always that informed. Because of this, developers have to deal with these users when they're just doing what they're told to do and upgrading their systems. In an ideal world, an "upgrade" or patch should not break their applications. But life, love, politics, and software development are not that simple. It's our job to make sure that they and their applications don't failat least, to some extent.

What About Shared Code and Team Development?

When companies develop applications using SQL Server, they often do so in teams, as the projects they build are far too complex to be completed by a single (or even a married) person. Usually, there are too many interdependencies for a single developer to handle by themselves, so other developers help with design, writing specifications, setting up databases and data interfaces, creating user interfaces, writing documentation and test procedures, and doing everything else that goes into a serious projectincluding making tea. A critical aspect of these more complex designs is the management team that coordinates the activities of the development team. Without disciplined management, these projects invariably degrade into chaos. That's where Team System comes in. This new version of Visual Studio is designed to help teams architect, design, map, and build, as well as archive and protect complex applications under development.

In shops both large and small, developers usually have to work with a specialized, dedicated database administrator (DBA) who's responsible for data and referential integrity and system maintenance, as well as a full plate of other duties. Sure, in lots of shops, the developer is the DBA is the developer. This is where protected objects and shared code play a critical role in the success or failure of the project and in the security of the database. Based on my research, DBAs generally prohibit unrestricted access to base tables in a database but grant execute rights to specific stored procedures or views to query and alter data. As I explain in Chapter 5, "Managing Executables with the Server Explorer," these stored procedures are set up very much like COM objects with (hopefully documented) named interfaces, named parameters (input and output), pre-defined return values, and any number of rowsets, print messages, error messages, and event log entries. Each of these procedures leverages SQL Server's ability to pre-compile a query plan that matches a given set of input parameters to further optimize the procedures' operations. By using stored procedures, a development team can more effectively and efficiently manage a complex application development project.

IMHO

Building designs around stored procedures requires something some shops simply don't have: discipline.

However, building designs around server-side executables requires something that some shops simply don't have: discipline. Once applications are written using one or more stored procedures, functions, or user-defined types, it's essential that no one (not even the DBA) change the code without understanding (and testing) the impact on the applications that use (or will use) the server-side code. The same can be said for any DBMS objectstables, views, column, user-defined types, rules, or any database object. This is even more important when datatypes, stored procedures, functions, and aggregates are written in CLR code. If the DBA (or a developer with DBA rights) makes a change in the schema, it can easily impact (irritate, antagonize, madden) others who depend on a specific schema layout for their applications. Yes, it's far harder to develop complex applications in teamsespecially if the individual members of the team aren't communicating or are running open-loop with their own agenda. Does this mean holding never-ending meetings to discuss these changes? Sure, if that's what it takesbut hopefully not. Better-disciplined teams use a variety of techniques to communicate changes to other members or simply delegate all changes in the form of documented requests to a DBA. Sure, the vast majority of the sites out there don't have a full-time DBA watching over the systems and developers to make sure they don't do something dumb. What usually happens is that one (or more) developers take on the role of the DBA on an informal basis, or perhaps the task is relegated to an off-site support team. What's tough is when the role of "DBA" includes everyone in the organization from the development team to the janitorial service and everyone in between. This can happen pretty quickly when the SA password is posted on the cafeteria bulletin board or handed around to anyone needing access to the database.

Building Shareable and Reusable Components

Shared code issues are not new. This problem started when companies started trying to build products with replaceable partsthe idea was that one could build an appliance, car, or software program with components built to be used in one or more products. That is, the same part could be used in copies of the same product or in other similar products. For example, Ford could use the same updraft carburetor in its 1930 Model A Tudor or its Model A truck. If the part "wore out," it could be replaced with one just like it, and the product would be back on the road again in no time. This concept revolutionized industry and product manufacturing all over the world. Instead of using skilled craftsmen to build custom (and often complex) products one by one, industries turned to more efficient mass-production assembly lines. While this required more complex designs and parts manufacturing had to be more precise, it made the manufacturing process simply a matter of fitting the parts together on an assembly line.

This concept works fine when it comes to relatively simple components like interchangeable windshield wipers and exhaust mufflers or SCSI/IDE/SATA hard drives. But when the products improve and a new model is shipped, do the product designers include these older parts? Sometimes, but not alwaysthey often improve them (that's the whole idea behind "new and improved"). But do the newer parts work in older modelslike a USB 2.0 drive in a USB 1.0 port? Sometimes, but not always, not nearly as frequently, and usually with some degradation in features or performance.

Let's stick to software instead of how well replacement car parts work (or don't). Shared software components are a lot easier to build than a Ford four-cylinder short block. They don't have bearings that wear out (or, at least, none of my applications do), but they do have lots of parts and often many interfacesthe "connectors" used to hook up the components to other components or your application. When these shared or "common" components are upgraded, it's usually (but not always) done without changing their name or the existing input or output parameters, so they expose the same datatypes, names, and values as the older version (this is called the "signature")they "look" the same as the component they're replacing (at least to the older program). This way, the new component can be installed without having to change the existing applicationit simply asks for a component by name and the OS loads itat least, in theory. However, the inner workings of the shared components change even if the external name, signature, or interfaces don't. These changes are either "bug fixes," "enhancements," or simply side-effects of other changes. Good or bad, in any case, the component behaves differentlyhopefully in positive ways, but any change invariably has unintended side effects. Since developers using a component have to work with the existing bugs or behaviors, when the bug is fixed or the behavior changes, the mechanism can fail to work in the way it did before the change. This can break your application, make your documentation incorrect, or simply confuse the user.

Sharing the MDAC Stack

Consider what happened when Microsoft "fixed" ADO 2.1 and introduced 2.5 (you don't want to know what happened to Versions 2.2 through 2.4). In this case, the MSADO15.DLL used to implement all versions of COM-based ADO) was replaced by a new DLL (with the same name), but the functionality of the new DLL was so different that it broke applications from one end of the planet to another (except in some parts of Cleveland). The real problem is that developers often have no choice when it comes to updating their applicationsthe MDAC stack (which contains ADOc) is included in new versions of Windows, Microsoft Office, Visual Studio, and other tools and applications, and is quietly installed without developers (or users) really knowing what's going on behind the scenes. No, it's not possible to use a specific version of COM-based ADOyou must use the version that's installed on the target system. Yes, you can code your application to address a specific type library (typelib) for one version of the COM component or another, but you'll still be executing the same binary. Each version of the MDAC stack included typelibs for older versions. For example, ADO 2.7 includes typelibs for 2.1, 2.5, and 2.6. When you code your application, you point to a specific versionthe current DLL or an older version's typelib. This way, your application still sees the signature of the older version, even though it's (always) implemented by the most recent MSADO15.DLLthe new version "fakes" the interfaces exposed by the older version. Sure, your application could install the latest version at deployment time, but if a newer version was already installed, you couldn't (shouldn't) overlay ityour only option is to hope your application developed and tested with the older version still works with the new "improved" version of ADO. Sadly, it sometimes doesn't. This catch-22 is called "DLL Hell." If you've been working with COM-based ADO "classic" (ADOc) for any length of time, you're doubtless familiar with the issues.

The .NET Revolution

The .NET Framework was invented to eliminate (or at least minimize) these very serious DDL reuse and sharing problems. In this case, applications can build an executable and test, debug, and deploy it with a more-or-less guarantee that the code will continue to work even if newer versions of the .NET Framework are installed (again, at least in theory). Yes, this means that your target system might have 2, 3, or 12 versions of the Framework installed, and several might be in memory at once. However, the applications should be able to function independently (assuming there is enough RAM and other resources). Sadly, I have seen some evidence (in the fall of 2004) that SP1 for the 1.1 version of the Framework has broken some existing applications. Given what I've said about "replaceable" common code, I'm not really surprised.

Each deployed .NET application uses its own Bin directory to hold its unique executables. Yes, this area on disk might contain copies of components used in other applications, but updating one application does not affect the others that share the componenteach application gets their own copy. It also means that different versions of the same component can (and do) exist in memory at the same time. .NET applications can be built using COM components, so .NET permits you to either reference a shared (common) copy that can be replaced or upgraded, or a private copy that belongs exclusively to your application assembly. As I'll show you later, continuing to use COM components (like ADOc) in your .NET applications won't help performance, but you don't have to worry (as much) about DLL Hell.

What About Multi-User Issues?

When designing your application, you need to think about how many users are going to be accessing the data at any one time. It's relatively easy to write an application that simply shares data with lots of peopleas long as no more than one person tries to update the data at the same time. This is as easy as printing a memo and handing it out at a meeting. If you're building a single-user application, the only contention you have to face when working with the database is from your own application. Sure, it's possible (and not that uncommon) to lock yourself out of a database by not thinking about data flow and what's being done behind the scenes to supply requested data. When you execute a query, the database engine finds the data referenced in the SQL and starts to return it to you. Until the process of passing that data to the client is complete (rowset population), the server can hold share locks on one or more rows, pages, extents, or tables, or even the whole database, depending on the scope of the query. If you try to update the data you just (started to) read, you might find that you're blocked until the rowset population is complete. I talk about this again when I discuss the DataReader in Chapter 8 and the Fill method in Chapter 10, "Managing SqlCommand Objects."

Once you plan to support two or more users with your application (running multiple instances of the application), the rules of engagement change and your application's code becomes an order of magnitude more complex. In this case, you can't afford to simply read all of the rows from one or more database tables into memory, update as you please, and write back the changes to the database. That's because your changes won't be seen by other users (or even by other parts of your own application) until you post those changes to the database server and the data is requeried. If other users change the data while you're holding it in memory (and possibly changing it), your code has to figure out whose changes are to take precedence. When two or more applications change the same data, it's called a "collision," and it's your job to clean up the mess caused by these data collisions.

If you're using a "disconnected" data architecture as commonly used in ASP.NET (web-based), ADO.NET "disconnected" applications, or "load and go" applications, your changes might not be committed to the server for quite some time. In the meantime, while your user is deciding what to buy, sell, or change, someone else (or a thousand others) could be doing the same thingfetching some, all, or none of the same data and making changes. If everyone has rights to change any and all data, what happens when someone tries to update the data you fetched? ADO.NET does not have any mechanism to automatically decide what to do or prevent changes to the data still in the database^[11]it's up to you to tell ADO.NET what to do if there is a collision (when an update is attempted and another change was made to the data since you last read it). I discuss techniques to avoid collisions and to deal with them once they occur in Chapter 12 "Managing Updates".

^[11] Yes, it's possible to set up a "pessimistic" lock to prevent access to rows, pages, or tables, but that's not an option for most architectures. I'll talk about locking in Chapter 8 "Getting Started with ADO.NET" and Chapter 12 "Managing Updates".

ADO.NET 2.0 does have the ability to work with SqlDependency objects that can be tied to a SqlCommand. Once properly configured, your application can receive an event if selected data on the server is changed. This is great if you've populated a pick list from a list of valid widgets. If some operation on the database changes members of your dependant Command object's rowset, an event fires to notify you of the change. I plan to write another EBook on this subject.

What About Scalability?

When you build an application that's designed to support more than one user, you're well along the way toward building a scalable application. Scalability simply means the application design automatically (or gracefully) supports a growing number of users or operations in a given length of time. Not all systems need to be scalable beyond specific limits. Often there is no reason to go through the (rather expensive) steps of building in scalability if your application does not need to support more than a dozen to a few hundred users. Sure, virtually all applications can scaleat least, to some extent. It's how well they scale that makes the difference. A well-designed application should scale linearly. That is, the performance should remain about the same as more users are addednot increase exponentially. As more users are added, a scalable system should be able to accommodate additional load, just as a properly matched engine, transmission, and drive train can handle pulling that 3.5-ton boat and trailer behind your SUV (assuming you bought the towing package).

Highly scalable designs should lend themselves to scaling "out" as well as "up." That is, applications that can scale out must be able to support large numbers of users and are coded so as to spread the load over several servers (clustered servers or server farms). Are scalable applications faster? Frankly, not always. I've seen that the techniques used to make an application scale up and out usually add to system overhead, so they don't perform as well as single- or few-user applications. However, these techniques are often a prerequisite to the ability to scale. In other words, the customer interface is not particularly snappy (as in a single-user application), but the system can support tens to hundreds of thousands of users with "reasonable" response times.

In the past when discussing scalability and replication, people usually meant designing systems that used several servers. Now, with the advent of mobile devices capable of hosting SQL Server Everywhere (SQL Mobile), scalability through replication can involve many hundreds, if not thousands, of "clients" all trying to stay in sync. It's clear that future scalable designs need to consider mobile devices such as smart phones, PDAs, television set-top boxes, and home appliances. Sure, before long, many of us will be writing programs to keep your refrigerator in sync with your PDA, so you won't forget that you're out of low-fat milk when you go to the store.

What prevents an application from scaling? Generally, it's because it does not know how to "share." Consider that an application must run in an operating system that's used to support other applications. If the server is dedicated to SQL Server, all the betterall the application has to do is consider other SQL Server users and their demands for resources. If your application demands server-side resources such as server-side cursors and temporary tables, consider that, for each connection made to the server (one or more for each user), any demand on server-side resources is multiplied. Sure, SQL Server can support lots of activity on TempDb (where cursors, intermediate sorts, and temp tables are created), and it knows how to efficiently manage memory to cache the most frequently used data pages. But if you expect to create a scalable application, you'll learn to conserve these resources and offload some of the work to the client system. For example, you might be able to reduce your application's resource footprint by deferring a sort to the client application or not using a server-side cursor (which is tough to do in ADO.NET anyway) or not building large temporary tables. However, by efficiently using server-side state, you might also reduce the amount of work done by the server each time your application needs to execute a query or perform an update. The trick is to learn how to balance these resources to maximize efficiency.

For ASP applications, scalability (up and out) is essential. In this case, the application your customer executes by navigating to an .ASPX page is executed for a very short length of time, but a significant amount of time is spent rehydrating the application session state. That is, the overhead involved to reset the application variables to the values set when the application was last accessed by this user. This can include restoring a saved data state as well or re-running a query to refresh the data state. Restoring this state can be expensive, so it's essential that as little "state" as possible be saved.

What Are the Limits?

The most efficient application in the world has its limitationsjust try to log on to one of the web sites mentioned in a Super Bowl ad during the game. The best applications know how to deal with excess volume as gracefully as they deal with normal demand levels. It's very educational to see how some of these web sites fail when stressed by tens of thousands of bored fans. We've all seen folks build systems that demand too much of their hardware. Consider that SQL Server, Oracle, or DB2 are world-class database management systems, but they all have limitations. Even when given one or more high-performance hyper-threading processors to execute the code, high-performance hard drives to fetch and store the data, gigabytes of fast RAM to cache the data, high-speed LANs to return the results, and efficient code to execute, any DBMS still has a top-end transactions-per-second limit. When the code is less than optimal, the disks are slower or fragmented, the CPU is clogged up with unrelated overhead, you don't have much RAM, or the LAN is busy streaming videos, your overall system performance can be degraded. When you use a client computer to host SQL Server (as with MSDE or SQL Server Express installations), you're often faced with a system that's short on everything hardware-wise, not to mention the fact that the user might be running Doom in the background. It does not make much sense to test and tune your new application on a development system that can radically out-perform the target user's system. Users rarely like to hear, "Well, it works just fine on my system...." It's our responsibility as developers to write code that's efficient enough to run on the available iron. Blaming hardware is like a farmer blaming the plow for the stones in his field.

IMHO

It's our responsibility as developers to write code that's efficient enough to run on the available iron. Blaming hardware is like a farmer blaming the plow for the stones in his field.

Does this mean you need to plunk down $9,000 for a new system to run SQL Server to run a few dozen users? Hardly. Consider that in the late 1980s, Microsoft IT ran a 386-33 server to support thousands of campus client/server usersand they were running the OS/2 version of SQL Server. SQL Server is capable of handling a tremendous volume of connections and work. Over the years, it's been tuned and refined to work even more closely with the operating system, so it's one of the most (if not the most) efficient DBMS platforms on the market. This means to support SQL Server in a small to medium-size organization or a moderately active web site, you might not need anything more than a bare-bones cast-off system that's capable of running Windows NT, Windows 2000, or Windows Server 2003 (depending on your security demands). I like to use my second- or third-oldest systems to run SQL Server for my sites. I suggest the same for some of our customers who need to support a few dozen to a few hundred users. It often means the customer can use a $400 system to do the job. If the system gets overloaded as the company gets more successful, it's a snap to upgrade to faster hardware. Sure, it means that our applications have to be especially efficient and not squander the somewhat limited hardware resources. If you write sloppy code and ignore our suggestions, you might not get away with a cheap server system.

IMHO

If you write sloppy code and ignore these suggestions, you might not get away with a cheap server system.

What Can Get in the Way on the Client?

When applications and databases are deployed, it's important to know what other applications your client application (which might be running on an IIS server, a hand-held PC, or a Windows 98 system) and your SQL Server are competing with. I've seen very few companies that give their employees top-end Pentium systems with lots of RAM as their client workstations. While your customer might be an exception, more typically these systems are older PII or PIII systems with very little RAM and tiny (by today's standards) 5,400 RPM drivesand that's in Trump tower. In many companies, I've seen far less capable systems. Companies simply can't afford to constantly upgrade their systems to accommodate inefficient applications or more feature-rich operating systems. This means it's our responsibility to write applications that are efficient enough to work on the client systemsas is.

All things should be taken in balance, thoughPeter developed a database system for use in local government where the indigenous hardware was adequately specified for running a copy of Microsoft Excel and Word concurrently without too much thrashing. However, once he installed the application, things went south when the customer wanted to run Excel, Word, and his application at the same time. He knew that he would hit that problem, as he was aware of the customer's hardware configuration. Given the situation and politics, it was cheaper for him as a software supplier tofactor in the cost to upgrade the system's RAM as part of the software cost rather than trying to engineer our application to use (far) less memory.

Consider that your application has to share hardware resources like CPU time, RAM, screen real-estate, and disk space on your customer's system with other applications including RAM-hungry word processors, ancillary background tasks, anti-virus tools, SQL Server instances, or even games. Sometimes, upgrading the hardware as a "software upgrade/fix" satisfies the anal bean countersand your customer.

Unless they are dedicated application workstations, client/server Windows systems are typically burdened with other applications like Outlook, Word, Excel, a browser, and a multimedia player or a game or twoall running at the same time. Sure, your application can compete with these applications if it's given enough resources (or does not demand too many resources) to stay in memory. However, we've all seen the new pariah: ad-ware, spy-ware, and viruses that can suck CPU and other resources out of your client system like kudzu sucks the life out of a tree. If your shop is highly disciplined (and I've seen some), it does not permit the users to run anything on their work PCs that has not been cleared and validated by the company. When the applications running on a Windows system demand more memory than the system has available, Windows chooses one or more blocks to swap to disk. This can dramatically slow down the system. A faster hard disk can help, but not as much as more memory. If SQL Server is loaded on the workstation, it knows how to release memory on demand, but you can configure it to use less memory in the first place.

IMHO

There's nothing that slows down SQL Server like a busy print or report serverexcept perhaps a Castle Wolfenstein screen saver.

What Can Get in the Way on the Server?

With server-based executables like IIS, ASP, and SQL Server itself, you need to be just as vigilant to avoid running jobs or extra services that compete with the system. Sure, this means it can be a lot more efficient to run SQL Server, Reporting Services, and IIS on separate dedicated serversif your customer can afford it (few can). Consider that SQL Server Express edition with Advanced Services can host Reporting Services (on local data) along with full Text search, so price might not be a factor. It also makes sense to put your print spoolers and exchange, ISA, and file servers on separate systems. When you don't need expensive licenses, split off server functionality to separate servers. Consider that it might be your own applications slowing down the server. As we've discussed in this chapter and I'll repeat throughout the book, a single poorly written application can bring a system to its knees. Sure, your application might not be at fault, but it could be blocking (or consuming) resources needed by another application that's chewing up CPU and disk resources like my granddaughter chomps on blackberries.

There's nothing that slows down SQL Server like a busy print server or Report Serverexcept for a screen saver. These both use CPU-intensive rendering engines that can exhaust a system's resources in no time. I got a complaint from a user that his client applications were running very slowly, but only sometimes. When he went to the server to see if the performance monitor was showing any bottlenecks, the server came back to life and began to run normally. A few minutes later (always about 10 minutes), the server slowed down again. I suggested that they turn off the 3D "Castle Wolfenstein" screen saver that was sucking every bit of processing power from the server.

What About Maintenance and Administration?

Every database application system needs to consider DBMS maintenance and administration as part of its responsibilities. We can't assume the user will periodically back up the database and know how to restore it when things don't go as planned (as they always do). We also can't assume the user will know how to set up new users or assign appropriate security rights. Sure, this responsibility can be delegated to a DBA if they are aware of these requirements. But if you're building an application that does not have the luxury of even a part-time DBA, you need to build-in the appropriate maintenance tasks.

When working with JET databases, these tasks are of critical importance, as JET requires periodic compression and repair. Since JET does not reclaim space lost when rows are updated or deleted, the database can grow out of space and your customer will be helpless to address the problem. Backup with JET is also problematic. Since JET won't let the backup (or compression) utilities run while any user has the database file (or lock file) open, you'll have to make sure all users are logged off before continuing. This can be problematic if Fred has left his system running and goes on an extended vacation.

Managing Maintenance Tasks

SQL Server is a lot easier to maintain, but the basic administrative tasks still need to be done. As I'll discuss in Chapter 2, SQL Server logs each change to the database to the transaction log. The log grows automatically (as does the database), but there's no need to compress SQL Server databases as space is reclaimed as rows are deleted and updated. However, you'll want to build routines to automatically back up and flush the transaction logs and periodically back up the database when building any application that has no dedicated DBA. These maintenance tasks can be done online, as SQL Server does not require that you log off users while these tasks run. All of the maintenance tasks can be done via T-SQL, but they can also be done via System Management Objects (SMO) in SQL Server 2005, which exposes a programmatic interface to .NET managed objects, properties, and methods to back up and restore the server. SMO is a completely new object model built from the ground up that is intended to replace SQL DMOalthough DMO is included in SQL Server 2005 for backward-compatibility only. Your application might also need to perform some (or a great deal) of login account setup and management, depending on the security approach you use. Note that expensive user rights management schemes might not be necessary when implementing SQL Express.

IMHO

While periodic backups are vitally important to your customer, they pale in comparison to the importance of "restores."

The Importance of Restoring Data

While periodic transaction log and database backups are vitally important to your customer, they pale in comparison to the importance of "restores." The most comprehensive backup in the world is useless if your application has not figured out the complexities of a system, database, and transaction log restore. All too often, I've seen customers lose hours, days, and their very existence because they were unable to restore their databases. I recommend that you start with a virgin computer and document every aspect of the process needed to bring that system back to a working state. This documentation must be stored both onsite, where it can be found quickly, and offsite (with the data), where it can be found by those who survive a more serious disaster. Don't discount the difficulty of repeating the process of resurrecting a system. Be sure to consider what you'll have to do if the Internet and online sources for software updates are not available. I like to build "ghost" images of the base system to expedite the recovery process, but it's really beyond the scope of this book to get into the nitty-gritty details of this process. Just be sure you and your application are ready. Your customer will thank you for it.

Keep in mind that backing up or even imaging a system drive won't do you any good if the replacement system has a different motherboard. I've had considerable difficulty when replacing dead (or dying) motherboards with an upgraded version. It seems that the installation routines that make low-level hardware configurations are executed only during initial Windows setup, and those cast-in-stone configurations prevent the OS from running on different hardware. This means if your disaster plan calls for system replacement, it had better be identical systemsperhaps ones that are many years old, if you expect a simple image restore to work.

Dealing with Backup and Deployment Issues

It's essential that you take into account the foibles of the hardware used to store your user's (and your) data. Hard disks (like any mechanical device) have a limited life. While some drives last longer than others, typically you should expect 3 to 5 years for a $100 drive. Electronic memory that retains its state post power-down (flash and other technologies) is also subject to failure. Power supplies, fans, every moving part, and every exposed port in the system are all subject to failure due to mechanical wear, dust infiltration, or voltage variations. Every type of storage is subject to failure of some kindeven clay tablets or concrete tombstones. This means you need to back up your data and be ready for the hour when the components fail. More important, you need to be able to restore your systemquickly, quietly, and efficiently, without your customer noticing. Frankly, the latter is harder. It's easy to execute the back up routines in any DBMS, and SQL Server's are especially easy to use. TSQL backups can even be done from your application. A new feature in SQL Server 2005 (SP1) implements fail-over clustering and database mirroring. I wish I had more bandwidth to explore that with you, but again, this is targeted for another EBook.

Restoring your system to its previous statenow that's the problem. When working with SQL Server (as I discuss in the next chapter), you should take a periodic log dump to save the transaction logs to removable media. The base database backup plus the logs equals a restored database. If you're running more than one database on a server, you'll have N database backups and one set of logsone log backup for each week/day/hour you back up (and dump) the logs. Having backed up the database and dumped the logs, you still can't restore if you don't have a base system to restore to. In my experience, when it comes time to restore, it's more likely that the entire system is poochednot just a single database. This means you might need to build a new hardware system up from a system image (using Ghost or the equivalent), or by handone step at a time. The manual process can take quite some time (days, in some cases) of configuring and loading, updating, posting web updates, installing service packs, and an endless list of configuration settings. That's why I encourage you to make a system image of the completed system once it's installed and ready to accept the database restore. Once you're ready, you can restore the base database backup and then restore the log files one at a time. This should return your database state to the point of the last transaction log dump/backup. That's why, for serious production systems, the DBA arranges to take relatively frequent transaction log backups and transports them off-site immediately to protect the database in case of a catastrophic problem.

IMHO

Restoring a database can be an order of magnitude more difficult than backing it up.

Dealing with Security Issues

Once you code your own data access routines, you'll need to set up your own security mechanisms if the data is even slightly sensitive. Sure, if the program is simply inventorying the contents of the user's refrigerator, you (and the user) might not care if someone broke into the database to find out that they are low on cottage cheese. You might be a bit more security conscious if you thought your mom was hacking into your refrigerator system to check on your stock of cold beer. However, because your code implements custom data I/O routines, you can easily incorporate sophisticated encryption techniques in the data you store. This will make it very tough to break into the data and uncover sensitive informationeven with full access to the hardware. Security mechanisms include ways to gate access to the server and the client application itself. They can be as simple as setting up logins in the Windows domain or creating users, groups, and specific permissions on database objects. While I discuss the litany of security issues throughout the book, I want to make one point clear: Do not develop or deploy an application that connects to the server with the SA account and password. If you ignore this advice, you'll not only compromise your data security, but you'll also mask a number of issues that will crop up if you even decide to tighten security later. Of course, there is an exception to every rule. When you deploy a SQL Server Express database with a User Instance, it can be accessed as SA as the user is granted those rights. I discuss this in detail in Chapter 9, "Getting Connected."

Security is not something you tack on to an application or a server like a final coat of paint on that Mustang you're restoring. Security systems that truly protect your data and your customer's data (and your job) are sewn into the fabric of your design from its initial conception to its deployment and implementation. These systems can leverage the Windows domain to validate users, but this is not a panacea. I'll show how it's possible to hijack the user's credentials to execute malicious code or SQL operations. This means that any program that solely depends on "trusted" (SSPI/Windows domain) authentication can be used to execute a query that erases the database or, worse, exposes its data to those who should not have rights to view or change it. No, I'm not going to create a Hacker's Guide to SQL Server, but I know that if you're familiar with the issues, you'll take appropriate steps to minimize the exposed surface area.

The way I see it, there are at least three security personas:

The SA or administrator, who has super-rights to the database, schema, infrastructure, and data
The developer, who needs rights to the schema but not necessarily the data
The user, who needs to be able to view and change data, but just that data to which they are entitled

To implement a secure system, you (or your DBA) need to limit unrestricted access to the data and the schema, as well as the security systems themselves. The first line of defense is the server itself. This means that you (above all) need to protect the SQL System Administrator (SA) account. Sure, this means you'll probably have to find a dedicated DBA and ensure that this person does not release the SA password to anyone. In some cases, you won't even know the SA password. With the default configuration (which disables mixed-mode SQL Server security authentication), any Windows domain administrator (any Windows login that's a member of the Administrators group) is automatically granted SA rights. This means that there might be any number of "SA"s in your organizationeach a potential security leak.

Each object in the database and the database itself needs to be protected. This means every database, table, view, procedure, and everything in between, including data, schema, and the permissions assigned to protect them.

To make this process more difficult than it has to be, Visual Studio 2005 does not include any of the tools you'll need to set up or manage user accounts or permissions. I have been asking for these for at least four versions of Visual Studio, to no avail. This means you'll still have to use the new SQL Server 2005 Management Studio to create new users, remove existing users, set up groups and roles, and assign permissions to these users. Keep in mind that while Windows logins imply that a "person" is logging on to the server, I think you should set up "roles" instead. That is, instead of assigning permissions to "Fred", you would assign permissions to "AdminClerk1". This way, when Fred is promoted, his replacement (Sally) won't have to log in with Fred's credentials. Actually, when I create applications, most of them don't require the user to provide credentialsthe application provides these behind the scenes. In this case, it's important to carefully gate access to the front-end application or web page.

Okay, so if all of the objects in the database are protected, how is your application going to extract data? Clearly, you're going to have to open some controlled pathways to the data. The most frequently used approach is to write stored procedures that fetch and update the data needed by your applications and grant limited access to those procedures. This means that logins (other than the SA) can access the data only through these carefully scripted procedures. In any case, applications (even administrative tools and the Visual Studio Server Explorer) won't be able to gain access to the data except through these stored procedures.

This approach is fine for end-user executables, but what about your development team? In this case, you're going to have to work out an approach that exposes the schema but not (necessarily) the data. I often like to use a parallel database that contains data that's distributed n the database, like the live data, but does not contain sensitive information. It can't just be a million identical rows or even randomly generated data, as the query optimizations you develop won't work against the live data. By using a parallel database, you don't risk compromising the live (secure) dataeven though it's a bit more trouble to set up. Developers need to be able to see the schema but not necessarily see (or modify) the live data. They might need to create new tables, views, rules, triggers, and procedures, but these tasks can be restricted to the DBAand they are in many larger sites.

Understanding Development Costs

As development managers, I'm always looking at the cost to develop and support applications, as well as how quickly they work and how quickly they meet customer needs (and get them out of our hair so we can devote our time to the next project or a new customer). I've found that efficient coding yields more satisfied customers more quickly because the application stability and its ability to solve the customer's problems is higher. This often means that I don't universally encourage use of home-grown data access strategiesespecially if there is even a remote chance that I'll have to upgrade the application to use shared data anytime in the future. You'll need to balance this issue as well.

Many shops seem to prefer to roll their own code instead of using third-party controls or components (including DBMS engines), even though outside solutions might handle 8099% of what the project requires. This is called the "NIH" (Not Invented Here) syndrome, which has cost our industry billions as companies reinvent the wheel for their internal projects. And it's not just components from third (outside) parties that are shunnedsometimes these are components written in their own company but by another department. I can't count the number of times I've heard that an outside vendor's product (while good) could not possibly be as good as "something I build in-house from scratch." It turns out that this is rarely the casebut the revelation occurs only after many months or years of lost resources that have been flushed down the drain along with a number of people's jobs.

IMHO

The "NIH" syndrome has cost our industry billions as companies reinvent the wheel.

Managing Resources

Every application needs hardware resources like RAM, CPU time, disk space, I/O channel time, and network bandwidth. The best applications manage these resources like precious commodities so that there is enough to go around and to improve performance. In the olden days, I worked with mainframe computers with 256K RAM and a bit 40MB hard drive. CPU time was measured by the second, and each CPU second had to be accounted for, as it was extremely expensive. These systems were capable of handling hundreds to thousands of users using architectures not unlike those in use today in IIS and ASP.NET applications.

Sure, these hardware resources are cheaper today than ever, and the price is still dropping. However, this does not mean you can afford to waste them. A good developer knows when to trade off using a shared resource with a local resource. For example, should you re-execute a query to resort a rowset on the server or simply resort it in memory at the cost of local CPU time and memory?

The network is another generally misunderstood and squandered resource. Most of us run Ethernet networks that can tolerate quite a bit of traffic, but when they reach capacity, they tend to collapse quite quickly. Our applications that connect to and through corporate networks have to compete for bandwidth with applications that are streaming videos, browsing the web, running NetMeeting or terminal server sessions, and playing a shared game of Age of Empires. If you add to that the IM and email applications that constantly poll and pump for new messages, you get a pretty good idea why your download is taking so long to run. If the corporate backbone supports only 10MB or a slower Wireless network, you can count on applications timing out from time to time (especially around the holidays, when lots of people are browsing Froogle for their spouse's new toy).

RAM is another precious commodity that applications have to share with Windows and all of the running applications, as well as the printer and other device drivers chewing up memoryeven though you haven't printed, scanned, copied, or faxed in a week^[12]. Add to that the dozens of services and extra little "helper" executables, and you no longer wonder why it takes so long to boot. Of course, if you spend any time at all on the web, your system will soon be overrun with a dozen adware, spyware, and remoraware applications reporting on your every click to those web sites in China and chewing up memory and CPU time. It's important to make sure your application knows how to tolerate a low-RAM situation and how to free resources when asked. Nope, you won't get an event that says "How about releasing some RAM?" unless you're really cleverthose details are handled behind the scenes for you by .NET (at least, to some extent). The best idea is to consider how much RAM your query is going to consume before you execute it and (even more so) after you create a data structure that has to be persisted in memory for any time at all.

^[12] My development system had (at one point in time) seven HP printer DLLs loaded that consume over 32MB of RAMin idle mode. I've since switched to a Canon printer that has no memory-resident footprint that I can detect.

In Chapter 9, I'll drone on and on about connectionsthe mechanism to connect your application's data access routines to the backend DBMS. While this is a somewhat complex subject, you'll discover that connections are another precious resourceexpensive to construct and limited in number by license (in some cases). I'll talk about strategies that open connections just in time and close them immediately as the query finishes, and other strategies that open connections and leave them open. Both of these approaches make sense in the right context. I expect you'll write quite a bit of code to manage your connections and recover from exceptions triggered when the connection is closed unexpectedly or there are no connections available (because you used them up).