Chapter 6: Microsoft SQL Server | Microsoft .NET: Jumpstart for Systems Administrators and Developers (Communications (Digital Press))

< Day Day Up >

Without exception, every time we build a .NET solution, SQL Server will figure somewhere in the architecture owing to the requirement to persist or permanently store data in a structured format somewhere along the line. This can range from state and session information to storing product information from Microsoft Commerce Server 2002.

Many of the Microsoft .NET Enterprise Server products will use SQL Server by default. In some cases you could possibly switch this to Oracle or another database, but that would generally be more trouble than it's worth. Because of this, Microsoft SQL Server is becoming increasingly prevalent throughout many organizations.

6.1 SQL Server Architecture

Microsoft SQL Server comprises a number of areas of functionality, including the following:

Relational Database -This is the SQL Server database engine and is where all of the data is stored in table-based structures. Data access is controlled by the relational database engine, and the clever use of indexes and data structures in this area increases the speed at which data can be retrieved (see Figure 6.1).

Figure 6.1: Typical view of data stored in SQL Server.
Administration -Managing and controlling the database is done via the Enterprise Manager, which sits in the Microsoft Management Console application. Microsoft has done a fair bit of work in making the database as easy as possible to manage, and every task, such as backups, table creation, replication, and disk management, can now be completed from this interface. In the past, database administrators (DBAs) would have used the command-line interface and laboriously typed in commands on the screen (see Figure 6.2).

Figure 6.2: SQL Server Enterprise Manager.
Replication -Replication comes as standard in SQL Server and enables you to transfer data from one server to another on an automated basis. This is useful if you have a number of offices geographically split that need, for example, to receive product updates on a regular basis.
Data Warehousing and (OLAP) -A data warehouse is used to store data in a read-only format that can be examined or mined by business experts to determine trends or patterns in the data set. OLAP, or Online Analytical Processing structures the data into multidimensional cubes, which can then be explored by analysis tools to understand, for example, who is buying which products in certain countries.

Figure 6.3 illustrates the major elements of SQL Server.

click to expand
Figure 6.3: Major elements of SQL Server.

6.1.1 How SQL Server Stores Data

In the bad old days, SQL Server stored the physical data in device files with associated transaction files held on dump devices. Segments were used to span a database across multiple physical disks, which was extremely complicated. In fact, dump devices were a hangover from the Sybase UNIX product, and they were eliminated from SQL Server 7.0 onward.

Each SQL Server database has three file types:

Primary -This file references other files and has the extension .mdf.
Secondary -Data that is not held in the primary file is held in one or more secondary files with the extension .ndf.
Log -Logged information is used in database recovery, and a database will have at least one log file with the extension .ldf.

Files have an external 'friendly' name and an internal logical name. SQL Server uses pages and extents for storage. A page is 8KB in size, producing 128 pages per megabyte.

6.1.2 SQL Server and Transact SQL

SQL-Structured Query Language-has been around in some form or other for a number of years now and is the usual way in which relational databases are accessed. In fact, with Microsoft SQL Server, regardless of which front end you decide to use against the database ultimately, a stream of SQL will be produced and sent to the database engine. This applies equally to the Enterprise Manager, Internet, Visual Basic, or any other client application.

Microsoft SQL Server has its own version or dialect of SQL called Transact-SQL (T-SQL). This is based on the language standards defined by the International Organization for Standardization (ISO) and the American National Standards Institute (ANSI) in 1992. In fact, SQL Server 2000 supports Entry Level SQL-92, published (not surprisingly!) in 1992. By way of comparison, Oracle uses a dialect of SQL called PL/SQL, which also supports Entry Level SQL-92. Both of these products have SQL language enhancements that are not considered standard to support specific database features.

T-SQL does more than just retrieve data from the database, and is used for the following:

Data definition -creating the tables, relationships, and other structures for storing the data
Data retrieval -allowing an application or user to retrieve data from the database
Data manipulation -the updating or adding of data
Controlling access -permitting users to log on to the server and carry out appropriate tasks
Maintaining data integrity -creating constraints in the database, protecting it from data inconsistencies or breaches of business rules
Data sharing -allowing users or groups of users to share data elements

6.1.3 Replication

As you probably know, replication is the process of taking data (or articles) from one server (the publisher) via a distribution server to subscribers. This allows data to be distributed according to business requirements and synchronized to ensure consistency. With merge replication, users can modify data and return it to the master server, where changes can be reconciled.

One use of this has been to implement transactional replication to copy each and every change to a master database to an off-site subscriber. This works by using the transaction log in SQL Server to detect each INSERT, UPDATE, DELETE, or other data modification, and then copying these to the subscribing backup server. This data transfer latency can be reduced to a few seconds if the appropriate network infrastructure is in place.

While this may be seen as an ideal solution for cheap standby servers, there are some catches. First, there is no transactional guarantee that the data has been copied over to your standby server. In other words, as the distribution server sends out the changes, there is no transactional confirmation that the changes have reached the standby server and that the data sets are in absolute synchronization. This means that if you do have a failure of your distribution server, you may have lost one or more transactions in the ether, depending on your replication latency.

Some businesses can accept this possible data loss as part of the cost/bene- fit analysis of this solution, but others will find this wholly unacceptable. On that basis, treat this solution with care, and maybe offer it to the business as a 'warm standby' rather than a 'hot standby' server.

6.1.4 SQL Server Performance and Availability

Performance, performance, performance. I know it's corny, but your competitor is 'just one click away,' and having a slow Web site is one sure way to push business to competitors. If we extract all the other possible performance bottlenecks from the Web solution and focus on the database, a number of performance-enhancing factors spring to mind.

A colleague of mine recently interviewed a developer and asked him about database normalization, and it was quite evident that even a senior developer was not aware of database design and the impact it has on performance. Normalization involves the removal of redundant, or duplicate, data from a database system. While this has definite upsides for efficient data storage, it results in data being distributed in separate tables. Each time your Web application submits a query (e.g., show me all the products that I have purchased over the past year), SQL Server has to work hard to join together data from, probably, the products, customer, and sales-history tables. During this process the optimizer needs to determine if it will be quicker to retrieve all of the product information and join that with the customer details, and then the sales history, or to retrieve the sales-history data and join that with the customer table, and then the products information, and so on and so on, until all the possible join combinations have been reviewed and the best strategy selected.

The number of possible join combinations is the factorial of the number of tables involved in the query, so the amount of work SQL Server has to do goes up with the number of tables in the query. This will result in SQL Server becoming bogged down with a tremendous amount of work in order to deal with (superficially) simple queries.

When we design the database for Web solutions, we will undertake a formal normalization process and follow that with tactical denormalization to deal with frequently accessed data sets.

Database I/O can have a significant impact on performance since the process of retrieving and submitting bytes to and from the storage subsystem takes a finite amount of time. If we assume that you have invested in decent, branded servers with loads of memory, a RAID set, and an excellent disk set controller, then we can look at the SQL Server considerations.

A useful feature in SQL Server is file groups. By grouping appropriate files together-for example, data files-into a single group, they can be managed as a single cohesive unit. This allows the DBA or developer to place the set of files onto a stripe set in a RAID configuration and reduce the amount of disk head movement to retrieve information. A typical Web site example may be placing a large product catalog into a file group alongside product-sales information. When users query product sales, the data will be retrieved from one file group.

One feature that will have an extreme influence on performance is correct index usage. Many Web solutions are fairly proscriptive in their functionality. For example, a B2C shopping site will be used by people browsing and, hopefully, purchasing products. This will involve entering search criteria or paging through product listings followed by the purchase action. You are very unlikely to allow ad hoc searching against random sets of data in this case. Part of the design work will be to identify the 15 or 20 query actions and then index to optimize.

Queries that are selective, which means they are likely to retrieve a small percentage of the table, lend themselves very well to nonclustered indexes. For example, when a user logs onto the site to place an order, you may wish to retrieve his or her previous orders. By having a nonclustered index, you would be able to retrieve the appropriate records without initiating an expensive table scan since the user first name + last name combination is reasonably selective, 90 percent of your customers happened to be named John Smith, which would initiate a table scan irrespective of the index since it takes less effort!

Clustered indexes force the data to reside in index order on the disk. As previously mentioned, reduction in I/O is generally a good thing; so reading sequential chunks of data during an index scan is often beneficial. The simplest analogy is the telephone directory, which is ordered in last-name order. In fact, a telephone directory is analogous to a composite clustered index since the first names are used for sorting within the last names. On that basis, if your Web site user wishes to search for a range of values (i.e., show me all the orders I placed between November and January), then, by applying a clustered index to the appropriate table, the results should be returned quicker. This may at times create a contradiction since you may need to choose between a clustered and nonclustered index. I would suggest that you choose an initial index, and then run the SQL Server Index Tuning Wizard after your site has been running for awhile to see if SQL Server requests an alternate based on the type of query traffic received.

6.1.5 Availability, Disaster Recovery, and Failover Clusters

Microsoft's own IT function reckons that 99.5 percent annual availability represents an acceptable commercial system. This equates to nearly 50 minutes downtime per week, per year. High availability is 99.9 percent to 99.99 percent availability, which equates to 10 minutes downtime per week, per year, and super high availability reduces this to even less downtime. Many customers we deal with can't accept any Web site downtime since it will affect business processes such as share trading.

Whenever we are designing a .NET system to withstand disaster, we always discuss a spectrum of likely solutions, ranging from large-scale, offsite storage and transaction management to simple tape-based backups. The deciding factor in all of these is always money, and businesses need to take a balanced view on the cost/benefit equation.

Interestingly, our customers are now considering disaster scenarios that would have previously been considered fantastic. Luckily, Microsoft SQL Server has some neat features that essentially build in good disaster recovery quite cheaply.

With Microsoft Cluster Services, a client can still access a SQL Server application if a server in a cluster fails since other servers in the cluster take up the workload automatically. The usual configuration is a two-node cluster with two SQL Servers working in an active/active combination. This means that each server undertakes its normal workload, but, in the event of failure, the remaining server takes on the additional workload of the failed server. With the SQL Server Datacenter product, it is possible to have up to four nodes in a cluster.

In the early days of clusters, the physical distance between servers was limited by the length of SCSI cable between them. This essentially placed the servers side by side in the same installation. With recent innovations in clusters and networking, this distance can now be significant, allowing a cluster to be distributed over two physically separate buildings, providing better disaster recovery. For example, with a Compaq Proliant HA/F500 cluster, you can have up to 100 km between nodes with Asynchronous Transfer Mode (ATM) or Gigabit Interface Converters (GBICs).

Unfortunately, there are some drawbacks with clusters. First, in an active/ active combination, if both of the servers are working hard and one should fail, the remaining server is doing the work of both servers, which would obviously have a detrimental impact on performance. Second, when a server does fail for whatever reason, the failover process can take up to a minute or so since transactions are rolled back and the work load distributed to the second server. The user, in this instance, will see a pause in service and may need to reissue queries or updates once the failover has completed.

6.1.6 .NET and SQL Server Development

.NET provides the infrastructure for developing a new range of Web-based solutions and dominates everything that Microsoft now talks about. Web services are especially dominating the conversation since they have introduced a new way of building and deploying some pretty advanced solutions using a combination of XML and Simple Object Access Protocol (SOAP).

As far as SQL Server is concerned, life carries on much the same because much Web-based development is fairly remote from the day-to-day running of a SQL Server. Web-based applications are essentially inserting, updating, selecting and deleting data from the database. The major differences now are extra demands on the DBA to build scalable and very robust databases capable of providing 24/7 services to Web site users.

Undoubtedly this will change with the advent of YUKON, the next version of Microsoft SQL Server, which will then turn it into a 'proper' .NET Enterprise Server. Until then, we will have to make do with the extra XML add-ons from Microsoft.

For the application developer, Web-based development has introduced a new set of challenges and choices when it comes to building solutions to work against SQL Server.

6.1.7 Connection Technologies

Over the years an alphabet soup of connection technologies has come and gone as the technology has matured from client/server to Web-based development.

These connection technologies include the following:

DB-Library was the only way to build SQL Server client applications for a number of years. It is a C application programming interface (API) and allows access to all of the SQL Server features. It is still supported by Microsoft, but its longevity is in doubt because of the newer technologies in .NET.
Open Database Connectivity (ODBC) was Microsoft's strategic data access API. It is based upon an open standard generated by the SQL Access Group (SAG) and was widely used by developers when building Visual Basic/SQL Server client/server applications in the mid- 1990s because it made SQL Server and other databases more easily accessible than was previously the case. ODBC supposedly allowed an easy swap of databases since it used a generic API, which essentially abstracted the database specifics from the client application. The reality was often very different though, and many applications were rewritten to overcome database-specific problems.
Remote data objects (RDO) overlays ODBC and was designed to give developers a lighter interface to SQL Server and Oracle and to enable the developer to use stored procedures and other RDBMS objects for improved functionality and performance. Data access objects (DAO) was used by Visual Basic developers to write applications that used ODBC.
ActiveX Data Objects (ADO) is designed to be like both RDO and DAO. Its object model offers more flexible access to data types- from spreadsheets to relational databases. It achieves this by creating an object interface to OLE DB. OLE DB is a data access interface that uses complex structures and pointers, both of which would be beyond the scope of Visual Basic.
Initially called SQL-OLE, SQL Server Distributed Management Objects (SQL-DMO) brings the world of objects to the management of SQL Server. SQL Server has an object model with 40+ separate objects containing 600+ interfaces. SQL DMO is implemented as a dual interface in-server process designed for developers who wish to create applications to undertake routine tasks or manage alerts, job execution, or SQL Server replication. SQL-DMO is not used to access any user data in the database.

6.1.8 Analysis Services and SQL Server

I have been fascinated by data warehousing for many years, ever since Microsoft introduced the CUBE and ROLLUP functions in SQL Server 6.5, enabling a T-SQL programmer to aggregate values into very simple result sets that could be used to analyze data trends. I also enjoyed telling great stories-or more likely urban myths-about the sales of beer and ale being intrinsically linked on a Friday night, when men were ordered to do the shopping for the weekend and took the opportunity to stock up on alcohol. This trend was, of course, discovered by analyzing or mining the store sales data in a data warehouse and serves to prove the commercial benefit of just such a system, or so the story goes.

A data warehouse is a read-only repository of subject-oriented data structured to enable efficient querying and analysis. Data is taken from operational systems and fed into the data warehouse via a data cleansing or scrubbing process. Once inside the data warehouse, the data is placed into tables, linked in such a way that predefined queries execute quickly and ad hoc queries run as speedily as possible.

Performance is critical to a data warehouse, as is the accuracy of the returned data since a failure in either aspect will lead to immediate disillusionment for the users.

< Day Day Up >