How the PivotTable Component Works with Data | Programming Microsoft Office 2000 Web Components (Microsoft Progamming Series)

[Previous] [Next]

One of the most important and complicated aspects of the PivotTable component is how it interacts with various data sources and how it manages that data during a session. This section will explain how the PivotTable control communicates with data sources, as well as how data is transferred and manipulated during a session.

The PivotTable control is a bit schizophrenic—much of its capabilities depend on the kind of data source to which it is connected. Essentially, the PivotTable control can use only two kinds of data sources: tabular and multidimensional. (Multidimensional data sources can also be called OLAP data sources; I will use these two terms interchangeably in this book.) We'll also discuss using XML data as a data source. Although XML data looks like any other tabular data source to the PivotTable control, it has a few requirements that warrant special discussion.

Tabular Data Sources

Tabular databases include any existing OLE DB data sources that expose tables of data. Traditionally, these are the relational database engines of the world. However, this category can also include nonrelational data providers—as long as they have some form of textual command syntax or named tables.

Figure 4-6 shows how a report initially looks when the PivotTable component is loaded with data returned from a tabular data source. (You can also see this report by running the PivotTableList.htm file from the Chap04 folder on the companion CD.)

The report is similar to that produced by using an external data range in an Excel spreadsheet. However, since the PivotTable control combines the functionality of external data ranges and PivotTable reports, you can now group the data by any field and create a new total for any field. For example, using the Move To Row Area, Move To Column Area, and AutoCalc toolbar buttons, you can transform this flat list of data into the PivotTable report shown in Figure 4-7.

click to view at full size.

Figure 4-6. A PivotTable report filled with data from a tabular data source.

click to view at full size.

Figure 4-7. A PivotTable report created from a flat list of data.

Because the data source is tabular, the PivotTable control can show the details behind any total—meaning you can expand any number and see the rows that contributed to it right in place. Figure 4-8 shows the general architecture for accessing tabular data sources.

click to view at full size.

Figure 4-8. Accessing tabular data sources.

When retrieving data, the PivotTable control first connects to the OLE DB provider named in the provider attribute of the connection string. When building a report in a designer, you typically will choose this provider from a list in the Data Link Properties dialog box. The provider is an in-process COM component that resides on the client machine and typically communicates to the data server (if it is an actual server machine) through a private protocol. For example, the provider for SQL Server communicates to the server using a variety of protocols, the most common being named pipes. However, the provider for Microsoft Jet databases requires file access to the MDB file because the Jet database engine is not a client-server system.

After the PivotTable control connects to the data source, it passes the contents of its CommandText property to the provider for execution. You can set the CommandText property at design time using the Data Source section of the Property Toolbox, or you can set it in code at runtime. The PivotTable control uses the ADO Recordset object to execute the command text, so any value that can be passed to the Recordset's Open method can be used in the CommandText property. These values typically include SQL statements or the name of a table, view, or stored procedure. The provider then returns an OLE DB IRowset interface that allows access to data returned from the command.

When working with a tabular data source, the PivotTable control uses ADO to immediately load the returned data into a component known as the Windows Cursor Engine (WCE). The WCE is a component provided in the Microsoft Data Access Components (MDAC), offering advanced scrolling, sorting, and filtering functionality over any data provider. The WCE loads the data from the source provider into its own memory cache, which will eventually page to disk if it contains more data than its memory threshold will allow. (This keeps the WCE from using all your available system memory.) After the data is loaded into the WCE, the PivotTable control communicates with it to filter, sort, and scroll around the data set.

The magic begins when you start to group the resultset by a field or when you create a total using steps similar to those described earlier. To perform the cross tabulations, the PivotTable control employs another piece of data plumbing known as the PivotTable Services component. This component is actually the client-side provider for OLAP Services, but it also can provide temporary cube creation on the client regardless of the data source. When you group fields or create a total, the PivotTable control hands the PivotTable Services component a reference to the data set and describes what dimensions and totals it needs in the temporary cube. This engine will create a temporary file in the folder that serves as the temporary folder for Microsoft Windows, so beware of this requirement if your company's policy is to not allow controls in a web browser to create temporary files.

Naming the Temporary Cube
When implementing this feature, one of the star data developers of the PivotTable component, David Wortendyke, had to devise a scheme for naming the temporary cube so that it would not overwrite any existing file or interfere with PivotTable controls running in other applications. His eventual scheme was to construct the name of the file using the current process and thread ID and the traditional CUB extension.

So when grouping and creating totals in a PivotTable control using a tabular source, if you see some crazily named file in your temporary folder, remember that it is a temporary cube file being created by the PivotTable control. Don't worry—these files will be deleted automatically when the control is destroyed.

When working with tabular data, the PivotTable control also will automatically generate two time hierarchies for each date or date/time field in the detail data. One hierarchy contains the grouping intervals Year, Quarter, Month, and Day; the other contains the intervals Year, Week, and Day. (Both hierarchies are necessary because weeks do not neatly roll up into months.) These automatic hierarchies make it easier to analyze data that has a time dimension, allowing you to see summary values for each of the intervals.

If you plan to use the PivotTable control on a web page, you might also want to investigate using Remote Data Services (RDS). This is another piece of data access plumbing provided in MDAC, which accesses data sources over HTTP. When using RDS, the only provider needed on the client machine is the RDS provider, which is installed with the Office Web Components. The RDS provider then communicates with the real data provider—for example, SQL Server—through a web server, allowing the native data source provider to exist only on the server. For more information on RDS, consult the data access portion of the Microsoft web site at http://www.microsoft.com/data.

Multidimensional (OLAP) Data Sources

While you are most likely familiar with tabular or relational data sources, you might not be as familiar with multidimensional (or OLAP) data sources. Before I describe how the different elements of the PivotTable component map to the structures of a multidimensional database, let me give you a brief introduction to multidimensional database concepts.

A Brief Overview of OLAP

In a relational database, tables and relationships are the primary data structures and concepts. You construct databases by defining tables that contain one or more columns, a primary key, rules, and so on. Then you relate those tables to each other by specifying the foreign keys that match primary keys in other tables. Once this is done, you can execute a SQL statement against the database engine and it will join, sort, restrict, and group data as needed to fulfill the request.

In a multidimensional database, the primary data structure is a cube, or more precisely, a hypercube. This structure is an N-dimensional matrix, which is a bit hard to visualize. The items contained in each dimension are called members, and the intersection of N members produces a number. Looking at an example will help make this much less abstract.

Imagine that we are modeling sales data for a company in this hypercube. In our example, we will start with two dimensions: Products and Customers. A two-dimensional structure is fairly easy to visualize because it looks like a matrix you might see for comparing two dimensions of information, such as a crosstab report. Figure 4-9 shows how this matrix might look.

click to view at full size.

Figure 4-9. A two-dimensional database.

Note that customer names appear in one dimension, product names appear in the other, and the numbers in the center are sales. For any combination of product and customer, a value is stored representing the amount of money the customer spent on the product. Also note that one extra member, named All, appears in each dimension. This member represents the total for all members in that dimension (often a sum of all the members). So the intersection of Customers.All and a specific product represents the total sales for that product. Similarly, the intersection of Products.All and a specific customer represents the total sales made to that customer. The intersection of the two All members is the grand sales total of all products to all customers.

Now imagine adding to this matrix a third dimension that contains salesperson names. The structure becomes a three-dimensional cube and conceptually looks like Figure 4-10.

Three coordinates—a customer, a product, and a salesperson—now determine each intersection or cell in the cube. Again a member named All appears in the Salesperson dimension and symbolizes the total sales for all salespeople. This structure can help you answer a variety of questions by allowing you to view the data from a number of perspectives. Since these numbers are stored in the structure, the multidimensional database can return any set of these cells quickly.

click to view at full size.

Figure 4-10. A three-dimensional database.

It is harder to visualize four dimensions, but suppose you want to summarize additional data values. For example, you might want to track quantity sold as well as the dollar value of items sold. These multiple values create a fourth dimension that has two members (Quantity Sold and Dollar Value of Items Sold). These data values are called measures in the multidimensional database; however, most data sources treat measures like any other dimension. Figure 4-11 shows one way of visualizing four dimensions of data.

Internally, the multidimensional database will still store all the data in one, fourdimensional hypercube. But you can conceptualize a four-dimensional data structure by thinking of the fourth dimension as multiple three-dimensional cubes. If you want to see the dollar value of items sold for a given customer, product, and salesperson intersection, you would look at the first cube. If you want to know the quantity sold for the same intersection, you would look at the second cube. You could of course expand this example to show tables of cubes and cubes of cubes—but I will stop before your mind explodes from visualizing 16-dimensional space.

click to view at full size.

Figure 4-11. A four-dimensional database.

Most multidimensional databases also let you group the members contained in a dimension, potentially specifying a parent and set of children for each member. In fact, dimensions have one or more hierarchies defined within them, and each hierarchy has one or more levels, each of which has a set of members. This mimics the natural structure of most categorical data—products typically fall into groups of related products, customers live in cities within states within countries, and salespeople belong to certain districts that belong to certain regions and so on. For example, the Customers dimension might have the levels All, Country, State, City, and Customer Name. The set of members at the Country level might be USA, Canada, and Mexico, while the set of members for the State level might be Washington, Oregon, British Columbia, Alberta, Jalisco, Veracruz, and so on.

It is possible to have multiple hierarchies within a single dimension. For example, if you have an Employees dimension, you might want to calculate travel expenses along organizational lines to see totals for each manager and department head, or you might want to see the totals for all employees that perform a certain job function, such as marketing, sales, product development, or executive staff. The members of the dimension remain the same (the individual employees), but they are organized into different hierarchies and therefore create different totals.

A number of books, journals, reports, and pieces of documentation describe multidimensional databases in depth. If you have purchased a multidimensional database, chances are the documentation that came with your database explains these concepts in much more detail than I have time to do here.

How the PivotTable Component Interacts with OLAP Data Sources

The PivotTable component communicates and interacts with OLAP data sources using a similar approach to tabular data sources. Figure 4-12 provides a general picture of the architecture.

click to view at full size.

Figure 4-12. The interaction between the PivotTable component and an OLAP data source.

The PivotTable control uses the OLE DB for OLAP standard defined by Microsoft and supported by many multidimensional databases. This model is an extension of the OLE DB standard, so much of the PivotTable control's interaction with the OLAP data provider naturally is similar to how it interacts with tabular data providers. The control begins by connecting to the data provider, which again is an in-process COM component that resides on the client machine. The provider determines how it will communicate with the multidimensional database. For example, OLAP Servicesuses a TCP/IP socket connection between the client and the server.

After the PivotTable component is connected to the OLAP data source, it can display all the hierarchies and measures in the specified hypercube through the PivotTable Field List window. As the user drags and drops hierarchies and measures to the PivotTable control, or as the developer inserts hierarchies and measures programmatically, the PivotTable control generates the necessary queries in MDX (Multidimensional Expressions, which is the query language defined by the OLE DB for OLAP specification) and executes them against the data source. The data provider returns the results, and the PivotTable control displays them onscreen.

When working with an OLAP data source, the amount of data transmitted across the network is quite small. The OLAP provider typically sends only the MDX query string to the server, and the server returns only the cells and member names that you see onscreen. The server sends only the aggregate values back to the client, rather than all the underlying detail data that was necessary to create those aggregate values. This makes the PivotTable control very responsive and allows the system to scale to support a large number of simultaneous clients.

Should I Make a Cube or Just Group Tabular Data?
When I show people that the PivotTable component can group and total tabular data so that it looks like a report from an OLAP cube, they often ask, "So why should I make a cube?"

The answer is twofold. First, using a precreated, server-based hypercube often yields drastic improvements in performance over using the PivotTable control to create temporary cubes of tabular data. Every time you group a new field in the tabular data, the PivotTable control must re-create the cube and reprocess all the aggregates. A server-based cube processes the aggregates only once and shares them with all the clients accessing the cube.

Second, a precreated cube can define hierarchies with multiple levels that establish a clear drill-down path through the data. When the PivotTable control groups relational data, it creates hierarchies for date fields only; it cannot know that fields such as Country, State, and City are actually three levels of the same hierarchy. In a precreated cube, you can define these hierarchies and make it easier for users of the data to find the information they seek.

XML

One special data source for the PivotTable component is a URL that returns XML data in a specific format. In the ADO 2.1 release, the Data Access Group at Microsoft (the group that makes MDAC) defined an XML format for persisting an OLE DB Rowset. They also built a piece of data access plumbing called the persistence provider, which can save and load an OLE DB Rowset by writing and reading XML data in this format. The PivotTable control is capable of using this provider to load the Windows Cursor Engine with XML data returned from a specified URL. Figure 4-13 depicts the architecture for this scenario.

click to view at full size.

Figure 4-13. Using the persistence provider to load the WCE with XML data.

In a moment, I will explain what type of connection string you must pass to the PivotTable control to use this approach. However, for purposes of the discussion at hand, the important piece of information the PivotTable control needs is the URL from which it should retrieve the XML data stream. The PivotTable control hands this URL to the persistence provider, which in turn uses the Internet services of Windows to request the results of that URL. The results are parsed and loaded into the WCE, and the PivotTable control continues on—just as it would when working with tabular data.

The format for this XML data is specific and unfortunately is not well documented. However, the easiest way to see what it looks like is to use the ADO Recordset object's Save method with the adPersistXML format to save the contents of a Recordset to a file. If you want to generate XML data in this format dynamically—for example, in a Microsoft Active Server Pages page—use the Recordset's Open method to test your output. If you can load your XML data into an ADO Recordset object, it will load into the PivotTable control because the control uses the same mechanism. For an example of generating XML data from an ASP page, see the source code for the solution discussed in Chapter 6.