Managing and Rendering Content

Internet Information Server Layer

The IIS server is at the top level of a CMS deployment. Just as Microsoft SQL Server 2000 is fundamental to the CMS server tier, Microsoft IIS is essential to the CMS front end. CMS is not a Web server; IIS is the application that serves up the pages to a Web browser.

CMS installs two ISAPI filters: Resolution filter and Resolution HTML Packager. The first ISAPI filter recognizes when a request for a CMS page is made and rewrites the URL to point to the actual template file. The Resolution filter is the main filter responsible for detecting CMS URLs and targeting the correct template files. The REHtml Packager filter is only used for one purpose. It aids the historical revision comparison feature by packaging HTML so that it is safe for JavaScript code. The filter puts the HTML into JavaScript variables that the browser then sends to an ASP page on the CMS server. This page then checks the differences between the two pieces of HTML and uses this information for the compare historical revisions feature.

CMS Server Layer

The CMS server layer is one of the original CMS components. When CMS was conceived, the server layer was tasked with a tremendous amount of the processing load. Before the Publishing API (PAPI), the Authoring Connector, or even the Web Author came along, the server did almost all of the work. At that time, the Site Manager was the only interface to the server layer. Both authors and administrators used the Site Manager to do their jobs. However, as CMS has evolved, the work has been distributed over a number of components. Authoring, for example, has been removed entirely from the Site Manager.

The CMS server is one of the only components that write changes directly to the CMS database. The private server object called AEServer is used for such interactions. There are some CMS features that are only exposed from within the AEServer API. For example, CMS user rights management is only available through the Site Manager; it is not exposed through the PAPI. At this time, there is no plan to expose AEServer to CMS customers. However, the CMS group at Microsoft is considering how they could expose the same functionality through the PAPI.

There are few components that interact directly with the CMS server. The Site Manager client is one of these interfaces. The Site Manager is used by CMS users to perform actions such as creating or assigning rights to resource galleries and other CMS containers. Developed before the introduction of the PAPI, the Site Manager communicates directly with the CMS server. Originally, the Site Manager was the only exposed interface to the CMS server. Partially for this reason, it is also one of the components that can be used remotely over HTTP. Two other components that interact directly with the CMS server are the SCA and the DCA. These components are described later in this chapter.

The CMS server is also responsible for managing the various CMS caches. The following sections will explain the purpose of these caches.

CMS Server Caches

Caching Background

Because of the architecture of CMS, caching plays a very important role. For example, .NET output caching can improve performance by 100 to 200 times. The CMS server also uses memory caches to enhance system performance. Instead of retrieving the information from the database every time the information is required, CMS temporarily stores the most frequently used data in the server memory caches.

The CMS server maintains memory caches and a disk cache for resources. Together, these caches allow the server to readily access data and files that are stored in the database. Examples of this data include CMS template metadata and placeholder content.

Here is a description of the various types of caching that CMS uses.

.NET Output Caching

.NET output caching can cache the rendered HTML of Web pages. When a cache hit occurs, the entire rendered page can be served to the client. It is possible that no CMS template code would be invoked. This offers a large performance increase because very little needs to be done to fulfill the request. Other than some rights checking, all CMS has to do is return the cached HTML.

Placeholder Definition Caching

This cache exists only in managed code. When a template is accessed, the particular version of the placeholder definitions contained within it is used as an index to the inflated placeholder definitions. Because the definitions are inflated using XmlSerialization, there is a performance hit when they are used. Multiplying that hit over every placeholder of every page is unacceptable. Fortunately, the placeholder definitions do not change very often. This makes them a prime candidate for caching.

Fragment Caching

Fragment caching involves gathering chunks of HTML and storing the content for later use. In ASP, this is the only form of caching that the template designer can implement. In the case of content blocks that are used in many pages it still has its uses in conjunction with .NET output caching. For example, all pages may share the same navigation code, and this code may be expensive to generate. If the cache hit is low on a large number of the pages, the output cache may not have a lot of use. In this scenario, the template designer could generate the navigation code once for the entire site and cache this fragment. All subsequent page renderings would use the pregenerated navigation HTML.

String Caching

Quite a few strings are repeated often in CMS. For example, every page has a list of placeholders. If there are 10,000 pages loaded into memory based on the same template with 10 placeholders, the 10 strings are replicated 10,000 times in memory. Thus, it makes a lot of sense to cache these strings and reuse them via shallow copying. In order to increase the effectiveness of the cache and to decrease the time spent in critical sections searching and updating the cache, each type of object that string caching is used on contains a specific string cache. For example, placeholders, node properties, and layout properties have their own string caches.

Hierarchy Caching

There is a collection object that contains the list of child GUIDs for a given node. The collection cache keeps track of these objects, as well as a backward list that maps every node to its parent collection. This allows for quick lookups when you are moving nodes, so that you can remove a node from the old parent and add it to the new parent easily. Similar to the node cache, the collection cache is represented by an LRK Hash table that maps the GUID to a collection class. This collection class is simply a list of GUIDs of child nodes. The child nodes themselves are not present in this collection. The collection only contains references to them.

Since CMS is running in a multithread process, cache access conflicts could compromise the integrity of the server. To prevent this, a special resource synchronization mechanism called "WNT critical section" is used to prevent conflicting access among different threads. Since CMS runs within the IIS process, CMS does not manage these threads. IIS schedules different HTTP requests to be processed by different threads.

Disk-Resident Cache

When a user uploads a file into a placeholder or adds a file to the CMS resource gallery, these files are stored within the CMS database. Examples of these resources are image files, Microsoft Word documents, and Microsoft PowerPoint presentations. These resources tend to be big (100K or more), and it may be slow to retrieve them from the database every time a user wants to access them.

The job of the disk-resident cache is to store the most often used resources on the server file system. Since local file access is much faster than reading an object file from a database, the retrieval of information is greatly improved.

This disk-resident cache also provides a temporary holding place for all URL-accessible files. In order for IIS to access the resources, these files must be stored on the local file system and referenced by an IIS virtual directory. These files are not deleted once the HTTP request is complete; they are stored on the file system so that they are ready to be downloaded by future users. This disk-resident cache is emptied every time the system starts. It is important to note that CMS templates are persistent files, so they are not cached by the disk-resident cache.

Node Cache

Once the CMS server is running, and users are accessing the system, node objects are instantiated in the server. These objects include concepts such as the CMS channels, postings, template gallery items, and resource gallery items. These objects are then filled with information from the relational database. Some of these node objects, known as the shared objects, stay in IIS server process memory space. These shared objects retain their information so that they can possibly be reused by the next user request. Because the database does not have to be accessed every time, information retrieval is accelerated.

In order to maintain and locate these objects and their relationship with the database, a special static control object is used. This object is called the node cache. The node cache stays in memory and holds references to the most often used nodes. This increases performance because it keeps more objects in memory and minimizes the calls to the SQL server. The size of the node cache is set in the SCA (Figure 3-3).

Figure 3-3. The SCA cache configuration settings

graphics/03fig03.jpg

The node cache periodically checks for changes in the database. This check happens every six seconds and on every edit request. Only nodes that have been modified are removed from the cache.

Node Collection Cache

A node collection object is used to store the relationship between child nodes and their parent node. It contains the parent node GUID and the list of GUIDs associated with the child nodes. As with the node cache, after the node collection object is instantiated, it can be initialized from information stored in the database. These node collection objects remain in the server process memory as long as there is sufficient space. This allows them to be accessed quickly. In order to maintain and locate these node collection objects, a special static control object is used. This is called the shared node collection cache.

User Node Cache

Currently, CMS users must be either NT users or Active Directory users. In order to find information about a particular user quickly, a user node cache is created. The user node cache consists of a map between the user name and the database ID of the AEUser table in the database. The map enables a user name to be quickly translated to the proper database ID.

Before a CMS asset is read or accessed, the user's rights are verified by the CMS server. A page node contains a reference to a list of rights groups, and each rights group contains its members. Using the map, the server can quickly determine the database ID of the user and determine what rights the user has to the appropriate page node.

High-Level Caches

There are a series of caches inside CMS that are used to speed up certain operations. High-level caches are contained entirely within the CMS system. They are not accessible via any API. The purpose of these caches is to store frequently used items that are computationally expensive. One example of this is the URL transformations that are done for the ISAPI filter. The same URL is transformed into the same new URL, so it does not need to be calculated every time. This would normally involve traversing the channel hierarchy, grabbing and inflating all the items, comparing their names, drilling down to the next level, and grabbing the appropriate template. Since the information does not normally change, we can simply store an entry in a cache that maps the friendly URL to the file system URL.

Unlike the node cache, these caches are not able to just toss out one item when a node changes in the database. The reason for this is that if you happen to rename your root CMS channel, every URL in the entire system will be different. This is the one problem with the high-level caches; they are based on an aggregation of data. On the plus side, changes to the system are generally few compared with the number of page hits.

There are also high-level caches for the rights of the guest user, placeholder content inflation, resource lookup, URL generation, and the fragment cache that can be used with ASP templates.

The Sandbox

The sandbox is where transient uncommitted data is held and operated on. It is quite similar to the master cache in that it contains a collection cache, a series of interfaces to allow for manipulation of nodes, and a node cache. This is a smaller node cache retrieved from the master cache.

The node cache in a sandbox is owned entirely by the sandbox for the duration of the request. It cannot be accessed by any other session, which allows it to be free of thread synchronization. Whenever a node in the sandbox cache is modified, it is cloned, the original is removed, and the clone is kept to be operated on. This again makes sure that object synchronization is maintained. The collection cache in the sandbox is very similar to the node cache in that it is the same object as the master cache, but it is cloned before being operated on.

When a transaction has been committed, the sandbox passes itself to the master cache to examine and synchronize with the master cache's internal state. At this point, modified objects are moved over, collections are updated, and derived behavior (high-level cache and so on) is enacted.

CMS Database

At the base level there is a Microsoft SQL Server 2000 database. The architecture of CMS relies heavily upon Microsoft SQL Server 2000. Without the database layer, it would not be possible to provide the key features of CMS.

The CMS server executes complex procedures against the database. These actions cannot be easily reproduced by manually manipulating the data store. For this reason, the database schema is not published, and customers are discouraged from direct interaction with the CMS database. CMS is designed such that almost all the interaction with the CMS database is handled by the CMS server layer. By using the CMS API, customers can be assured that their interface to the database has been fully tested.

The database schema consists of 40 tables. CMS data is stored in various formats within these tables. Opening the AEUser table, for example, plainly shows the users and groups that have been added to the CMS rights model (Figure 3-4).

Figure 3-4. AEUser table shown in SQL Server Enterprise Manager

graphics/03fig04.gif

However, finding information stored in a particular placeholder is more difficult. The reason for this is that CMS was designed to be used through the PAPI, not via direct access to the database. Placeholder data is often stored in binary large object (BLOB) format and cannot be easily found or read. Resources within placeholders are stored in the "blob" table, but placeholder data is stored in the NodePlaceholderContent table. It is difficult to find placeholder data, because CMS assembles various pieces of content from this table.

Although it is possible to create a cluster of CMS servers, CMS only supports an "active passive" SQL Server cluster. The reason for this is that the CMS 2002 database is restricted to one Microsoft SQL Server 2000 machine. In contrast to Commerce Server 2002, it is not possible to split CMS tables between different Microsoft SQL Server 2000 boxes. Also, the CMS server will recognize if tables have been added or deleted. Altering the database schema in either of these ways will render the CMS database invalid for new installs or upgrades.

When the CMS server is in read-only mode, roughly 90% of interaction with the database is through stored Structured Query Language (SQL) procedures. An example of this is a read-only CMS server caching CMS pages. When the server is writing such as creating new CMS pages approximately half of the database queries are ad hoc server queries and the other half are stored procedures.

As mentioned earlier, the schema for the CMS data store is not published, and direct requests to the database are not supported by Microsoft. However, many people are curious about the architecture of the data store. Here is some information that will allow you to perform read-only tasks on the CMS data store. One of the more interesting tables in the CMS database is the Node table. This table contains the core information about the CMS system.

Table 3-1 shows some examples of node types that are stored within the CMS Node table.

Using this information, you can do simple read-only queries against the CMS database. Under no circumstances should write operations be performed directly against the database. The PAPI is the appropriate interface for writing to the CMS database. Note that since the schema has not been released, these queries may not work on future releases of CMS.

Table 3-1. CMS Node Types
Node Type	Type of Object
1	Server
4	Channel
16	Page or posting
64	Resource gallery
256	Resource gallery item
16384	Template gallery
65536	Template

These are examples of read-only SQL queries that can be run against the CMS database:

Find the page/posting by name:
Select * From Node Where type=16 AND name='Page Name'
Find the page and posting GUID from the page name:
Select posting.nodeguid posting,page.nodeguid page From Node page inner join node posting on page.nodeguid = posting. followguid and page.name='Page Name' AND page.type=16 and page.isshortcut=0
Find the posting from the page followguid:
Select * From Node Where followguid = '{87B29228-3CFC-426A-8DD0-7B5E33CXXXXX}'
Find information about the posting (add your own GUID):
Select * From Node Where nodeguid = '{0369528B-943B-40A0-8B57-C5E3578XXXXX}'