Making the Case For (and Against) Real-Time Data | The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset

Real-time data sounds so cool. Who would not want to have data thats as fresh and current as possible? And surely we, as professionals, want all enterprise data to be conformed and consistent? What better way than to populate the DW/BI system in real time?

As we discuss in this section, it may sound cool, but its not so easy.

What Makes Delivering Real-Time Data Hard?

The hardest things about delivering real-time data are, as usual, related to people rather than to technology. Nonetheless, there are some technical challenges, too. Microsoft offers many interesting features to help you with the technical challenges.

First, the people-based challenges. One of the biggest issues for the DW/BI team is to meet both the demands of those who want real-time data, and those who most decidedly do not. You may think that pushing the latency closer to real time would be a win for everyone. Surely, in the absence of any cost of delivering data, wed all prefer to have real-time data?

Actually, no. Anyone whos trying to develop a non-trivial analysis knows that you need to work on a static dataset, where the numbers arent changing from moment to moment and query to query. If you try to make these analysts work against a dynamic database, theyll copy a bunch of data to a personal computer: exactly the kind of behavior youre probably hoping to stem with your DW/BI project.

A second person-related problem has to do with the DW/BI team itself. If you add operational duties to this teams charter, you risk having the urgent overwhelm the important. In other words, the real-time operations will co-opt the strategic nature of the DW/BI system. Think about it: This is why strategic groups generally have no operational responsibility.

Now, on to the technical problems.

If youre going to put real-time data in the DW/BI system, you need to use the same conformed dimensions, and surrogate keys, used elsewhere in the DW/BI system. Otherwise, theres really no point. Youll need to process new dimension members, and updates to existing dimension members , in real time. The real-time (and non-real -time) facts that flow in must swap their business keys for the correct surrogate keys.

This is more of a design challenge than it is a computational one. Assuming you receive only those dimension rows that are changing, the ETL for hourly processing should take approximately 1/24 the time of the ETL for daily processing. The challenge is in the design.

Most DW/BI systems that are updated daily or monthly track dimension changes on a daily basis. In other words, youll combine potentially many database transactions on a customer account record into a single end-of-day image. And you associate all sales transactions that occurred yesterday with that end-of-day image. This isnt strictly true, but business users almost always prefer this structure. There are some exceptions, especially with online businesses. Daily changes are easier to manage, especially if some of the changing attributes track history as a Type 2 Slowly Changing Dimension.

But if youre consuming and processing dimension change information intraday, you really dont have this option. You need to date and timestamp all Type 2 changes to the dimension member, and you may end up adding several rows for a customer account during the day. This may or may not be what the business users really want.

This Type 2 problem is annoying, but its not as bad as the Type 1 problem. An update to a Type 1 dimension attribute requires updating the dimension members row, potentially multiple times during the day. The update itself is mildly problematic . We dont like lots of updates to tables. The worse problem, however, lies with any aggregationsrelational aggregate tables, indexed views, or Analysis Services aggregationsthat are built on the Type 1 attribute. All of these aggregations need to be adjusted, not just for the data thats flowing in today, but for all time.

Warning

Any precomputed aggregations defined on a Type 1 attribute need to be adjusted (or rebuilt) if the attribute changes. This adjustment affects the entire time series of data. It affects even the fact tables that youre not updating in real time. It affects any fact table that uses that dimension (and has aggregations defined on the Type 1 attribute). The cost of recomputing (or fixing) the aggregations may be bearable for daily processing. Its almost certainly too expensive for data latency of less than a minutethe system constantly would be recomputing aggregations.

With large data volumes , dont define any aggregations on a Type 1 attribute thats updated in real time neither for the real-time fact table, nor for any other fact table that uses that dimension. This problem is not unique to Microsoft; its a fundamental issue with real-time data.

In general, weve found that real-time DW/BI systems are harder, more costly, and more time consuming to build. Hence, adding a real-time element to your DW/BI project will greatly increase its risk of failure. We certainly dont recommend including real-time data in a Phase 1 project.

What Makes Real-Time Data Valuable ?

Real-time data is valuable if it will help you make an important decision. What type of decision is based on something that happened seconds, minutes, or even hours ago? An operational decision. A CEO might want a whiz-bang real-time monitor on her desktop, but any CEO whos running a company based on what happened five minutes ago should be replaced as soon as possible (unless its a very small company where the CEO also works the cash register!). Ditto for any VPs and their analytic support staff. Even the VP of Operations focuses on tactics (if not strategy), rather than whats happening right now. Thats what line managers are for.

Note	The exceptions are obvious: major disasters or fundamental changes in the business landscapelike an earthquake, a chemical spill, or a competitors legal troubles. A CEO is pretty unlikely to learn about such events from a real-time BI dashboard, dont you think?

Its worth repeating this point: Real-time data is intended for operational staff and their direct managers. Its intended for operations. In the vast majority of cases, an operational systemwhich is, after all, designed to support an operational processshould deliver the real-time data.

Were not saying this operational staff, and the decisions they make, arent important. Any one decision is small, with a small impact on the business. In aggregate, operational decisions are hugely important. Someone should help these guys out. Maybe that someone is the DW/BI team, which has expertise with business analysis and the query and analysis toolset. Maybe not.

What Should You Do?

The business has a problem: It needs better, more flexible access to real-time data. Business users often look at the DW/BI tools, and ask for that same level of functionality and flexibility on operational systems.

Often, the best solution is to improve the operational systems. If your operational system is purchasedas most areyour company could extend the set of reports that are shipped with the product. If those reports cant be modified, you should think about whether you made a great product selection. But in the meantime, you could certainly replace or extend the packaged reports with a Reporting Services portal. That portal may even be integrated with the DW/BI portal.

The problem is more interesting if the need for real-time data spans several operational data sources. In this case, you must perform significant data transformation and integration steps. For really simple scenarios, where the data is perfectly clean (what alternate universe would that be?), both Analysis Services and Reporting Services can span multiple data sources. But realistically , any integration thats not trivial will require that the data flow through Integration Services. From there it can populate a relational database, or even flow directly into Analysis Services or Reporting Services.

If you attempt to integrate the real-time data directly into the relational data warehouse database, you run into the dimension change issues that we discussed earlier in this chapter. You have two choices:

Integrate the real-time data into the DW/BI system, using and updating the surrogate keys. This approach can disrupt the user experience for that part of your user community that doesnt use real-time data. It can also be extremely difficult and problematic to manage.
Keep the real-time data separate, and access it using only business keys. This approach can lead to multiple versions of the truth.

We discuss these approaches later in this chapter. Although the first approach is more intellectually appealing, we dont think its very practical for most systems. Segregating the real-time data is a more common approach.

We realize we havent answered the main question: What should you do? Of course, the answer depends on your requirements and environment. Table 17.1 collects some of the approaches we think are most practical and ranks them on key criteria. These criteria are:

How easy is the approach to implement?
What kind of latency can the approach deliver? Zero latency (++ in the table) is delivered only by direct queries against the transaction system.
How well does the approach perform at large scale (data volumes or users)?
How well does the approach support ad hoc access?
How well does the approach support access to data thats been transformed and integrated?
Does the approach meet our standards for consistent reporting across the enterprise (a single version of the truth)? No approach scores better than a zero on this criterion because of the difficulties inherent with real-time data and analysis in a company that also has an integrated DW/BI system.

Table 17.1: Methods for Delivering Real-Time Data
	EASE	LATENCY	PERF/SCALE	AD HOC	INTEGRATE
Source directly from the transaction system
Create Reporting Services reports directly on the transaction database. By default, reports are executed on demand and contain the most recent data.	+ +	+ +
Create Reporting Services reports directly on the transaction database. Set up the reports either to cache or to snapshot on a schedule of your choice.	+ +	+
Create an Analysis Services database directly on the transaction system. Use proactive caching settings.		+	+ +	+ +
Use Reporting Services Report Builder directly on the transaction database.	+	+ +		+
Populate a relational data warehouse Database in real time
Use Reporting Services reports on the relational DW.		+	+		+ +
Use Reporting Services Report Builder on the relational data warehouse.		+	+	+	+ +
Create an Analysis Services database that you update in real time.		+	+ +	+ +	+ +
Perform integration, but skip the relational data warehouse Database
Write an Integration Services package that integrates and transforms data. Set up a report that sources from that package.	+	+ +			+
Write an Integration Services package that integrates and transforms data. Use that package to populate Analysis Services objects directly.		+	+	+ +	+

In Table 17.1, we ranked each criterion on a scale of 2 (noted as) to +2 (noted as + +). An ease score of + + means the method is the easiest approach for delivering real-time data. No method gets positive marks across the criteria.

The remainder of this chapter describes the technical features in the SQL Server 2005 toolset that enable business users to access data in real timelatency of less than one day.