Populating a Document with Data on the Server

Consider the following portion of an all-too-common server scenario. An authenticated user, perhaps a salesperson, requests an Excel spreadsheet from a server. The spreadsheet is an expense report, and the server is an ASP, ASP.NET, or SharePoint server. The server code looks up some information about the user from a database, Active Directory, or Web service. For example, perhaps the server has a list of recent corporate credit card activity that it will prepopulate into the expense list. The server starts up Excel but keeps it "invisible" because there is no interactive user on the server. It then uses the Excel object model to insert the data into the appropriate cells, saves the result, and serves up the resulting file to the user.

This is a very suboptimal document life cycle for two reasons. First, it is completely unsupported and strongly recommended against by Microsoft. Word and Excel were designed to be run interactively on client machines with perhaps a few instances of each running at the same time. They were not designed to be scalable and robust in the face of thousands of Web server hits creating many instances on "headless" servers that allow no graphical user interfaces.

Second, this process thoroughly conflates the "view" with the data. The server needs to know exactly how the document is laid out visually so that it can insert and remove the right fields in the right places. A simple change in the document format can necessitate many tricky changes in the server code.

But automatically serving up documents full of a user's data is such a compelling scenario that many organizations have ignored Microsoft's guidelines and build solutions around server-side manipulation of Word and Excel documents. Those solutions tend to have serious scalability and robustness problems.

What can we do to mitigate these two problems?

Data-Bound VSTO Documents

As discussed in Chapter 17, "VSTO Data Programming," one way to solve this problem is to move the processing onto the client. Just serve up a blank document that detects whether there is no cached data in its data island and fills its datasets from the database server if so. When the client is ready to send the data back to the database, it connects again and updates the database. No special document customization has to happen on the server at all, and the database server is doing exactly what it was designed to do.

This solution has a major drawback, however: It requires that every user have access to the database. From a security perspective, it might be smarter to only give the document server access to the database, thereby decreasing the "attack surface" exposed to malicious hackers. What we really want to do is have the document ready to go with the user data in it from the moment they obtain the document, but without having to start up Word or Excel on the server.

XML File Formats

Avoiding the necessity of starting up a client application on the server is key. Consider the first half of the scenario above: The server takes an existing on-disk document and uses Excel to produce a modified version of the document. Excel is just a means to an end; if you know what changes need to be made to the bits of the document and how to manipulate the file format, you have no need to start up the client application.

The Word and Excel binary file formats are "opaque," but Word and Excel now support persisting documents in a much more transparent XML format. It is not too hard to write a program that manipulates the XML document without ever starting up Word or Excel.

However, the XML file formats have some drawbacks. Although it is certainly faster and easier to manipulate the XML format directly, parsing large XML files is still not blazingly fast. XML files tend to be quite a bit larger than the corresponding binary files. And worst, although the Word XML format is "full fidelity," the Excel format is not. Excel loses information about the VSTO customization when it saves a document as XML.

Furthermore, unfortunately the Word XML file format does not store the data island in human-readable, editable XML. Rather, it serializes out the binary state that would have gone into the binary file format data island.

Also, we have not addressed the second problem that we identified earlier. Now we are not just manipulating the view, we are manipulating the persisted state of the view to insert or extract data. It would be much cleaner if we could simply get at the data island.

We need a way to solve these additional problems; we need a solution that works on binary non-human-readable files, works with VSTO-customized documents, and cleanly separates view from data.

Accessing the Data Island

Chapter 17 showed how to cache the state of public host item class members that contain data in a "data island" so that they could be persisted into the document as XML, independent of their user-interface representation. The VSTO 2005 runtime library comes with a class, ServerDocument, which can read and write the data island directly; it does not need to start up Word or Excel on the server. The ServerDocument object can read and write Word documents in binary or XML format and Excel documents in binary format.

Let's re-create the above document life cycle using the data island. Then we describe the advanced features of the ServerDocument object model in more detail.

Part One. An Introduction to VSTO