Aggregation by Itself


Now that all your options have been clearly spelled out, the reality is that there is really only one choice. Do not treat the aggregation process as part of the fact table data loading process (that's exactly why they're separate chapters in this book). Simply treat aggregation as its own distinct phase in the overall process (i.e., data load, then aggregate), as shown in Figure 7-4

Figure 7-4. Simply Treat Aggregation Process by Itself

The astute observer will recognize that this was exactly what was being shown by a portion of the second option in both Figures 7-2 and 7-3. Look back and you'll see that by simply removing the data loading portion of the second option in both of these previous figures, all you're left with is exactly what's shown in Figure 7-4. This is, hands-down, the absolute best way to go.

The architecture separates the base fact and aggregate load logic. So, you don't have the project management nightmares related to having a single point of failure. In fact, you can implement each aggregation on its own, which is a natural fit for implementing via materialized views with refreshes ( assuming that the aggregation processes will perform only inserts , as refresh does not yet support the MERGE command). This option also has the added benefit of subdividing the tasks into more manageable and less complex pieces for developer resource allocation and batch job scheduling.

However, there are two drawbacks to this option. First, you'll have to read the base fact table information at least twice (i.e., once to load it from the data files and a second time to summarize it). In fact, if you implement each of the aggregation processes separately, you'll then have to read the base fact table information N +1 times (where N represents the distinct number of aggregates based on that base fact table). Second, this option introduces an inter-process wait since you cannot load any of the aggregates until the fact has completed its data load.

A reasonable question that people often ask is: Why can't some of the aggregates be based off other aggregates with this approach? In other words, why can't the monthly aggregate be summarized from the weekly aggregate? In theory, it could. But recall the example where the week aggregate was found to be essentially useless? You'd have to rewrite the dependent aggregation process if the aggregate it's based on changed or became invalid. So why introduce unnecessary dependencies? Plus, you'd be introducing an inter-process wait state per dependency (e.g., you could not aggregate for month until week was complete). Always summarize from the base fact.

Mapping the optimally efficient aggregation processing options for various Oracle versions and summarization requirements yields the three scenarios shown in Figure 7-5

Figure 7-5. Efficient Aggregation Implementation Options

To provide context for the following descriptions of these three aggregation scenarios, refer once again to this book's simple data warehousing data model shown in Figure 7-6.

Figure 7-6. Aggregates in the Dimesional Entity Relationship Diagram

Scenario 1

If you only need inserts (i.e., not upserts), then you have two fact table implementation options with this scenario: aggregate tables or materialized views.

If you're using simple aggregate tables (i.e., not materialized views), then your best and most obvious choice is to merely use a simple, parallel, direct mode load insert. There is absolutely no need to write procedural logic such as Pro-C or PL/SQL code. The parallel insert with direct mode load shown below will smoke any cursor-based Pro-C or PL/SQL alternative:

 alter session enable parallel dml; insert /*+ parallel(aggr,10) append */ into pos_week aggr select  /*+ parallel(fact,10) full(fact) */   $WEEK_ID, location_id, product_id   sum(nvl(sales_unit,0)),   sum(nvl(sales_retail,0)),   sum(nvl(gross_profit,0)) from pos_day fact where period_id between $BEG_ID and $END_ID group by $WEEK_ID, location_id, product_id; commit; 

How does this work? The insert select command expects three key pieces of information passed in as parameters: the beginning and ending period IDs for the days represented by the target week and the period ID for the target week itself. The select portion of the command performs group operations on the selected rows to summarize the correct aggregate data. This selection is done in parallel as indicated by the select parallel hint. Then, the insert portion of the command inserts that data in direct load mode (i.e., no logging) as indicated by the insert append hint. This too is done in parallel.

In the example above, note that Oracle will fork 31 processes to perform this command: a coordinator process (the controlling parent for other processes), 10 select sub-processes, 10 sort sub-processes, and 10 insert sub-processes. Don't forget, the inserts will utilize direct mode loads. That's one heck of a lot of work being done in parallel and with great efficiency, and all controlled by a few simple Oracle DML and query hints.

Of course, you'll probably have multiple target weeks that need to be aggregated. So, you could enclose the code above in a script that loops through those weeks and calls the code once per week. You could even submit those invocations in the background such that all of them could be running at once. Just make sure not to go overboard on your total parallel processing load.

Two simple rules for selecting the optimal parallel DML degree for your hardware is that the total number of parallel processes should equal:

  • No. of CPUs when CPU Bandwidth >= Disk I/O Bandwidth

  • 2 “4 * No. of CPUs When Disk I/O Bandwidth Much > CPU Bandwidth

For example, at 7-Eleven, we had 16 CPUs and an EMC disk array with 4 GB of cache using RAID 0+1, so a total parallel process load of 32 “64 concurrent processes was both sustainable and optimal (depending on the DML operations). In other words, our disk array could handle much more volume than our CPUs could generate, so we could increase our parallel degree greater than our CPU count until the disk I/O peaked.

To give you an idea of just how efficient this approach is, at 7-Eleven, we could recreate from scratch an entire 500-million-row aggregate in about 20 minutes!

Of course, Scenario 1 is also a perfect fit for materialized views. The solution would be as follows :

 create materialized view mv_pos_week parallel (degree 1) nologging BUILD IMMEDIATE REFRESH FAST ENABLE QUERY REWRITE as select  /*+ parallel(pos_day,1) full(pos_day) */   period.wk_id period_id, location_id, product_id,   sum(nvl(sales_unit,0)),   sum(nvl(sales_retail,0)),   sum(nvl(gross_profit,0)) from pos_day,       (select a.period_id wk_id, b.period_id d1_id, c.period_id d2_id       from period a,            period b,            period c       where a.levelx='WEEK'         and b.levelx='DAY'         and c.levelx='DAY'         and a.period_date     = b.period_date         and a.period_date + 6 = c.period_date         and exists (select 1                     from pos_day                         where period_id between b.period_id and c.period_id)) period where period_id between period.d1_id and period.d2_id group by period.wk_id, location_id, product_id; 

This code creates a fully populated aggregate implemented as a materialized view ”with on-demand fast refreshes. That's all there is to it! Of course, you still need to create all the fact table bitmap indexes and statistics on the materialized view such that queries will use the star transformation explain plan. But otherwise , it's just like any other fact table.

Scenario 2

If you need upserts and you're using Oracle 8i, then your only choice is, unfortunately , to write procedural logic such as PL/SQL or Pro-C code. For many cases, Pro-C will be the superior choice in terms of raw performance.

Why Pro-C instead of just using PL/SQL? This is a fair question. PL/SQL is a great language for doing database internal programming, and that's exactly what we have here with essentially a "table to table" copy. Furthermore, SQL Plus is a lightweight command-line program that we could easily embed within UNIX shell scripts to execute PL/SQL code. So again, why not just use PL/SQL?

Remember that my goal is to show you the most optimally efficient implementation, which is Pro-C. For many shops , the answer will be to go with PL/SQL. The rationale is often that their developers are more comfortable with PL/SQL (at least more so than with Pro-C and its associated makefiles). Plus, some UNIX vendors no longer provide a free C compiler (and Oracle currently only supports the GNU-C compiler on Linux). If that describes your shop, then by all means, stick with PL/SQL.

Prior to Oracle 9i, PL/SQL lacks one key programming construct that Pro-C provides: Dynamic SQL Method #2: prepare and execute. This programming technique can shave about 15 “20% off data loading program runtimes , so in many cases, it's worth the extra costs. How does this technique work? Remember, every time Oracle processes a command, it must parse, bind, execute, and fetch. With Dynamic SQL Method #2, you can prepare a statement once outside of your loop processing and then execute it repeatedly in the loop without Oracle having to reparse or rebind it. So, your Pro-C program would look like:

 int main(int argc, char *argv[]) {   ...   EXEC SQL DECLARE C1 CURSOR FOR     select  /*+ parallel(fact,10) full(fact) */       :WEEK_ID, location_id, product_id       sum(nvl(sales_unit,0)),       sum(nvl(sales_retail,0)),       sum(nvl(gross_profit,0))     from pos_day fact     where period_id between :BEG_ID and :END_ID     group by :WEEK_ID, location_id, product_id;   strcpy(sql_command,"UPDATE pos_week \     SET sales_unit = nvl(sales_unit,0) + \             :h_sales_unit:i_sales_unit, \         sales_retail = nvl(sales_retail,0) + \             :h_sales_retail:i_sales_retail, \         gross_profit = nvl(gross_profit,0) + \             :h_gross_profit:i_gross_profit \     WHERE period_id = :h_period_id:i_period_id \       AND location_id = :h_location_id:i_location_id \       AND product_id = :h_product_id:i_product_id;"   EXEC SQL PREPARE update_command FROM :sql_command;   strcpy(sql_command,"INSERT INTO pos_week VALUES \     (:h_period_id:i_period_id, \      :h_location_id:i_location_id, \      :h_product_id:i_product_id, \      :h_sales_unit:i_sales_unit,      :h_sales_retail:i_sales_retail, \      :h_gross_profit:i_gross_profit)");   EXEC SQL PREPARE insert_command FROM :sql_command;   EXEC SQL OPEN C1;   /* Process fact table records */   EXEC SQL FETCH C1 INTO :h_period_id:i_period_id, \                          :h_location_id:i_location_id, \                          :h_product_id:i_product_id, \                          :h_sales_unit:i_sales_unit,                          :h_sales_retail:i_sales_retail, \                          :h_gross_profit:i_gross_profit;   while (sqlca.sqlcode == 0) {     ...     /* First - try to update existing record */     EXEC SQL EXECUTE update_command         USING :h_period_id:i_period_id, \               :h_location_id:i_location_id, \               :h_product_id:i_product_id, \               :h_sales_unit:i_sales_unit,               :h_sales_retail:i_sales_retail, \               :h_gross_profit:i_gross_profit;     /* Second - if update fails because record                 not found, then insert record*/     if (sqlca.sqlcode == 1403) {       EXEC SQL EXECUTE insert_command         USING :h_period_id:i_period_id, \               :h_location_id:i_location_id, \               :h_product_id:i_product_id, \               :h_sales_unit:i_sales_unit,               :h_sales_retail:i_sales_retail, \               :h_gross_profit:i_gross_profit;     }     else if (sqlca.sqlcode != 0) {       ...     }     ...     EXEC SQL FETCH C1 INTO :h_period_id:i_period_id, \                            :h_location_id:i_location_id, \                            :h_product_id:i_product_id, \                            :h_sales_unit:i_sales_unit,                            :h_sales_retail:i_sales_retail, \                            :h_gross_profit:i_gross_profit;   }   EXEC SQL CLOSE C1;   ... } 

Of course, you'll probably have multiple target weeks that need to be aggregated. So, you could enclose the program call for the code above in a script that loops through those weeks and calls the program once per week. You should submit those invocations in the background such that some or even all of them could be running in parallel. Unlike Scenario 1, which would run both the insert and select in parallel, Scenario 2 only runs the select command in parallel ”and only for the execute stage ”but the fetched records are processed serially . That's why you should run this scenario with more parallel background invocations.

As before, the total parallel DML processes should equal:

  • No. of CPUs When CPU Bandwidth >= Disk I/O Bandwidth

  • 2 “4 * No. of CPUs When Disk I/O Bandwidth Much > CPU Bandwidth

Scenario 3

If you need upserts and you're using Oracle 9i, then you'll want to use the new MERGE command. It's simply a new DML command that encapsulates an UPDATE and INSERT into a single command processed by a single call to the database. Your developers will love this new syntax, as it's exactly what they've been coding as separate, related DML commands with intelligent error handling. Now it is a single command, and a single network request sent to the database server. So, you get the best of both worlds : easier to code and runs faster, too. Here's the code to perform our day to week aggregation:

 alter session enable parallel dml; merge /*+ parallel(pos_week,10) append           parallel(pos_day,10) full(pos_day) */     into pos_week aggr     using (select  $WEEK_ID, location_id, product_id              sum(nvl(sales_unit,0)),              sum(nvl(sales_retail,0)),              sum(nvl(gross_profit,0))            from pos_day            where period_id between $BEG_ID and $END_ID            group by :WEEK_ID, location_id, product_id) fact     on (fact.period_id   = aggr.period_id and         fact.location_id = aggr.location_id and         fact.product_id  = aggr.product_id)     when matched then         update set            c4 = nvl(f.c4,0) + nvl(y.c4),            C5 = nvl(f.c5,0) + nvl(y.c5),            C6 = nvl(f.c4,0) + nvl(y.c4) +                 nvl(f.c5,0) + nvl(y.c5)  y.av     when not matched then         insert values (); 

An interesting thought is that you may be able to capitalize on the efficiency of the new MERGE command to eek out a few more parallel processes. With MERGE , you can consider amending your total parallel DML processes to:

  • 1.5 * No. of CPUs When CPU Bandwidth >= Disk I/O Bandwidth

  • 2.5 “4 * No. of CPUs When Disk I/O Bandwidth Much > CPU Bandwidth

One final note: You cannot implement this technique with materialized views as the refresh mechanism only supports queries (and not the MERGE command ”yet).



Oracle DBA Guide to Data Warehousing and Star Schemas
Oracle DBA Guide to Data Warehousing and Star Schemas
ISBN: 0130325848
EAN: 2147483647
Year: 2003
Pages: 79
Authors: Bert Scalzo

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net