Integration and Extension of Persistent Data | Beyond Software Architecture[c] Creating and Sustaining Winning Solutions

In the vast majority of cases, you don't want your customer directly accessing or changing your schema or the data it contains. It is simply too risky. Direct access bypasses the validation rules that usually lie within the services or domain layer, causing corrupt or inaccurate data. Transaction-processing rules that require the simultaneous update or coordination of multiple systems are only likely to be executed if they are invoked through the proper services. Upgrade scripts that work just fine in QA can break at a customer's site if the schema has been modified.

All of that said, there are times when a customer wants access to persistent dataand there are times you want to give it to them. They may want to write their own reports or extract data for use in other systems, and it may just be easier and faster to let them do this on their own without your intervention. Or they may need to extend the system with additional data that meets their needs now rather than wait for a release that may be months away. Of course, your marketect may also push to allow customers direct access to persistent data as this makes switching costs exorbitantly high.

Since chances are good that you're going to have to provide some access to your persistent data; the following sections describe a few techniques that should help.

Views

Providing a layer of indirection between components is one of the broadest and most time- tested principles of good design. With reduced coupling, well-placed layers of indirection enhance flexibility.

Putting Business Logic in the Database

One of the most generally accepted principles of enterprise application architecture is to put business logic within the services layer and/or the domain model layer of the architecture. However, while this is a good principle, it isn't an absolute rule that should be dogmatically followed. Many times a well-defined stored procedure or database trigger is a far simpler and substantially more efficient solution. Moreover, you may find that the only way to safely offer integration options to your customer is to put certain kinds of logic in the database: I would rather rely on the database to invoke a stored procedure than rely on a customer to remember to call the right API!

Views are logical databases constructed on top of physical schemas. They provide a layer of indirection between how the data is used in a schema and how it is defined. The value of a view is that it gives you some flexibility in changing the underlying schema without breaking applications or components that rely on a specific schema implementation. It is useful in a variety of situations, such as when you want to give your customer a schema optimized for reporting purposes. The first, and many times most important, way to provide access to persistent data is always through a view.

User Fields

Many times you know that a user will want to add some of their own data to a schema but you don't want to provide a lot of tools or infrastructure to support this, because creating these tools and associated infrastructure is likely to cost a lot of time and money to develop. A simple approach to providing for extensible data is to define extra fields in key tables that can be customized by a user. Simply throw in a few extra ints, dates, and strings, and provide a way to edit them in your user interface. The result often produces user interfaces with labels like: "User Date 1" or "Customer String". This simple approach can be surprisingly effective!

Unfortunately, it has plenty of drawbacks. Database purists feel nauseous when they find systems based on user fields because the database has not been properly modeled . You can't run reports on these data because there are no semantically meaningful fields for query manipulation select column_42 from TBL_USERDATA where column_41 > 100 is not very understandable. Different users may interpret the data in different ways, further compounding the errors ("I thought I was supposed to put the date of the last purchase in the second field, not the fourth field"). This can be mitigated by providing system administrators with tools to edit the field labels (instead of "User Date 1" the field might say "Date of Last Purchase"), but this does not solve the problem. The lack of data modeling means that data is highly suspect, as it is stored without the application of any business rules such as basic edit checks.

You'll have to judge whether or not the relatively trivial ease with which user columns can be added to the database are appropriate for your application.

Hook Tables

Suppose you have created an inventory-tracking and warehouse management system. Each operation on an item, such as adding it to the inventory and storing it in the warehouse, is represented by a discrete transaction. Your customer, while pleased with the core functionality of the application, has defined additional data they want associated with certain transactions, and they want to store them in the same database as that used by your application to simplify various operational tasks , such as backing up the system. As with every other request, your customer wants these data added to the system now. They don't want to wait until the next release!

Hook tables are one way to solve this problem. They give your customer a way of extending the persistent storage mechanism that can be preserved over upgrades. The proper use of hook tables requires coordination among multiple layers in your architecture, so be careful with them.

Begin creating hook tables by identifying those aspects of the schema the customer wishes to extend. Next identify the events that are associated with the most basic operations on these data: create, update, and delete. Include operations that are initiated or handled by any layer in your architecture, including stored procedures. Take care, as you need to identify every such operation.

Now, create the hook tables. A hook table's primary key is equivalent to the primary key of a table in your current schema; it has been designed to allow customers to add new columns. The primary key should be generated by your application using a GUID or an MD5 hash of a GUID. Avoid using auto-increment fields, such as automatically incremented integers, for this key, as such fields make database upgrading difficult.

Create, update, and delete modifications to the primary table are captured as events, and a registration or plug-in architecture is created to allow customers to write code that responds to them. Thus, when some action results in a new record being added to your database, a notification event is received by customer code. Upon receiving this notification, your customer can perform whatever processing they deem appropriatecreating, updating, and/or deleting the data they have added to the schema under their control.

In general, event notification comes after all of the work has been done by your application. In a typical creation sequence using a plug-in architecture, your application performs all of the work necessary to create a record and does the insertion in both the main application table (with all of its associated data) and the hook table. The hook table insertion is easy, as all it contains is the primary key of the main table. The list of plug-ins associated with the hook table is called, with the newly created primary key passed as a parameter.

More sophisticated structures allow pre- and postprocessing transactions to be associated with the hook table to an arbitrary depth. Preprocessing can be important when a customer wishes to perform work on data that is about to be deleted and is commonly required when you're coordinating transactions that span multiple databases.

Hook tables are not designed to work in every possible situation you may encounter, mostly because they have a number of limitations. Relational integrity can be difficult to enforce, especially if your customer extends the hook table in surprising ways. Because hook tables might also introduce unacceptable delays in transaction-processing systems they should be kept small. You also have to modify your upgrade scripts so that they are aware of the hook tables.

Spreadsheet Pivot Tables

Quite often a customer asks for dynamic reporting capabilities that may be difficult to support in your current architecture. Before trying to create such capabilities, see if your customer is using or has access to any of the powerful spreadsheet programs that provide interactive data analysis. I've had good results with Microsoft Excel, so I'll use it as my example.

Excel provides a feature called a pivot table that allows hierarchically structured data to be dynamically manipulated by users. With a pivot table, users can quickly sort , summarize, subtotal , and otherwise "play" with the data, and they can arrange it in a variety of formats. My experience is that once introduced to pivot tables users quickly learn to manipulate them to suit their own needs.

Pivot tables are based on extracts of your application data. Once these data have been exported they are no longer under your control. Security, privacy, and accuracy are thus just some of the concerns that accompany any use of extended data, and pivot tables are no exception. Pivot tables are often associated with ETL scripts, discussed in the next section.

Extract, Transform, and Load Scripts

Extract, transform, and load (ETL) scripts refer to a variety of utility programs designed to make it easier to manipulate structured data stored within databases. Extract scripts read data from one or more sources, extracting a desired subset and storing it in a suitable intermediate format. Transform scripts apply one or more transformations on the extracted data, doing everything from converting the data to a standard format to combining them with other data to produce new results. Finally, load scripts take the results of the transform scripts and write them to a target, usually another database optimized for a different purpose than the source database.

If your application is in use long enough, chances are good that customers are going to want to extract and/or load data directly to the schema, bypassing the domain layer. There are several reasons to do this, including the fact that the programmatic model provided by the API is likely to be too inefficient to manage transformations on large data sets. A special case of ETL scripts concerns upgrades, when you need to migrate from one version of the persistent storage model to another. Although it may be tempting to let your customers do this work, there are distinct tarchitectural and marketectural advantages to doing it yourself.

Charging for ETL Scripts

Recall from Chapter 2 that the key to effective pricing is relating pricing to the value perceived by your customer. Prepackaged ETL scripts are an excellent example of this. In one application we created several of them to help our customers write customized reports. I estimated that these scripts, which were fully tested and documented, cost us about $50K to create, and would cost a customer even more to create. We were able to charge about $25K for them, which quickly became one of our most profitable modules. Another advantage was that these scripts further tied the customer to our application.

From a tarchitectural perspective, providing specifically tuned ETL scripts helps ensure that your customers obtain the right data, that transformations are performed appropriately, and that load operations don't break and/or violate the existing schema structure. An alert marketect can also take advantage of productized ETL scripts. As part of the released product, these scripts will be tested and documented, considerably enhancing the overall product value.

Tell Them What's Going On

Even though I generally recommend against giving your customers direct access to your schema, I also recommend that you provide them with complete information about it so that they can gain a proper understanding of your system's operation. This means technical publications that detail the data dictionary, data, table, and naming conventions, important semantics regarding the values of any special columns, and so forth. Such documents become invaluable, usually motivating customers to work within the guidelines you have established for system integration. The lack of usable, accurate technical documentation is the real source of many integration problems.