Assume you own a small company that is selling a key set of products which are essentially static in nature; the base list of products just doesn't change. New products may be added to the list, but the original set of products remains the same. In this scenario, several of your products are sold each day and the sales data arrives at your data warehouse sometime after normal business hours. Further, your company is headquartered in the United States. The business analysts on your team want to see sales data no later than the next working day following the actual sale. In this scenario, assume incremental processing of your cube takes a relatively small amount of time (just 1-2 hours), which can be completed before start of the next business day. It is also assumed that data updates (information about new products added into the system) in your relational databases arrive within a reasonable time.
The traditional approach to solving the outlined scenario would be to have the dimension storage as MOLAP and doing an incremental update of dimensions after relational data update is completed. This approach is computation intensive and is a fairly costly operation. Following that dimension data update, an incremental process of the relevant measure groups is required, and once that completes, the consumers of the cube can browse the results. This approach has advantages. Indeed, this approach is good whenever your relational data updates occur regularly at a specific time interval and you have sufficient time to update the cubes to appropriate users. Several existing Analysis Services users in the retail space use this solution. Data typically arrives during the night and the cubes are processed nightly for use the next business day.
As with the traditional approach, you can do an incremental process of dimensions and measure groups. Or for the sake of completeness, and given the time, you could even do a full process of the entire cube. Again, these things typically take place during the night so time is not often a constraint. You could use SQL Server 2005 Integration Services to create a package to do this job as seen in Chapter 15. Alternatively, you can use the proactive caching feature. There are two basic methods (with multiple variations) within proactive caching that can be used to initiate data updates. They are query-based method and the timebased method; the method you choose will depend on your needs.
One of the solutions for the long latency scenario is to use proactive caching feature in Analysis Services 2005. In the proactive caching solution, you set proactive caching to kick in as soon as the data changes using the option scheduled MOLAP. For the schedule MOLAP option you need to specify a query that is to be run at scheduled time intervals to determine if there has been a change to the source data. Here is how it works: the first time Analysis Services sends the specified query to the relational data source, it collects and stores the response. That stored response provides a baseline against which subsequent query results can be compared. When a subsequent query returns a result set that does not match the baseline, it is presumed there has been a data update and proactive caching will start the process of incremental update. Depending on the other proactive caching settings such as latency the cache will be updated. The latency setting is a property associated with proactive caching; specifically, it tells Analysis Services how long to wait between cache updates. This is what provides that real-time feeling to the end user.
Figure 18-10 shows the proactive caching option where you specify a polling query that will detect the change in source data. This could be as simple as a count of rows in the relational table or as complex as a hash value of the entire result set. For the long latency scenario you would need to click on the Enable incremental updates option so that dimension and partitions are processed incrementally only with the data that has been added. If this option is enabled, Analysis Services processes the dimension or partition object by sending a Process Add statement. If you do not specify this option, Analysis Services will automatically send a Process update statement to the dimension or the cube partitions. Process updates on dimensions could be expensive based on the size of the dimensions and the partitions and aggregations built for the partitions. For tradeoffs on which processing option (process update or process add) would be good for your cube, please refer to the performance chapter (Chapter 13) in this book. After specifying the polling query, you need to specify the processing query that will retrieve appropriate data from the relational data source for processing.
Here is a handy proactive caching technique that can be applied to dimensions by which your incremental processing query is optimized. First, you specify the polling query. The results of the polling query can be used as parameters to the incremental processing query. For example, if you have SELECT max (product_id) from Products — let's say initially it returns 100 — then 50 products are added. When the polling query is subsequently run, you would get 150. These two parameters can then be used to create the incremental processing query as
SELECT * from Products where product_id>COALESCE (?,0) And product_id <=COALESCE (?,-1)
In this way, the processing query only returns those rows that were added since last data change. This technique can be a real timesaver if your Products table is the size of, say, Wal-Mart's or Amazon.
The second method of proactive caching is to have the dimension and partition data updated periodically. While this approach could hardly be considered sophisticated, there is no doubt it gets the job done and doesn't take much in the way of setup. Here is how it works: You set proactive caching to update any new source data and itself (the cache) at a predetermined time. For example, if you want to set the update at "24 hours since last process," you set a proactive caching property which ensures the MOLAP cache is rebuilt every 24 hours. In the long latency scenario you would typically not set the latency property since you want the new data to be available as soon as the MOLAP cache is rebuilt. You need to specify the option of when to rebuild the cache using the option "Update the cache periodically" as shown in Figure 18-11. This option ensures that the MOLAP cache is rebuilt every 24 hours. However you should be aware that the cache update occurs 24 hours after the previous update. For example, on the first day if the processing started at 12 midnight and it took 30 minutes for the cache to be updated then on the second day the cache update will start at 12:30am instead of 12 midnight. It would have been nice to have the update cache to happen at the same time each day. Probably we will get this in future releases. However you can implement this functionality using SQL Server Integration Services as seen in Chapter 16. You might have to reset the proactive caching property periodically to keep it aligned with your business needs so that the most up to date data is available for your end users. The configuration you setup using for updating cache periodically is also referred to as scheduled MOLAP since the cache update is scheduled. If you click OK in the screen shown in 18-11 you will be in partition's properties pane where you will see the scheduled MOLAP option selected as shown in Figure 18-12.