Statistics gathered daily could contain both daily cycles and weekly trends. This can lead to chaotic-looking graphs when activity is plotted day by day. You can improve your graphs easily using SQL.
Look at Figure 4-1, which shows a chart of the raw figures for the number of page views for a web site per day, over the course of one year. These figures come from Webalizer, the web log analysis program (http://www.mrunix.net/webalizer). It is difficult to see the trends because the weekly cycle overwhelms the daily detail, and obscures the long-term trend.
Figure 4-1. Page views per day over one year
To understand the data, you need to separate the effect of the weekly cycle from the table. You can see the weekly cycle by taking the average for Monday, the average for Tuesday, and so forth. In Figure 4-2, Monday to Sunday are numbered 0 to 6.
Figure 4-2. Average page views by day of week
Notice that the value of the Sunday column (column 6) is less than half the value of the midweek columns. This is helping to cause the zigzag pattern in the original graph. If you view the data averaged per week (see Figure 4-3) rather than per day, it is easier to see the long-term trend.
Figure 4-3. Smoothed data page views averaged for one week
|
Before you can isolate these trends, you must turn the dates into integers to more easily put them into the appropriate buckets. You can pick an arbitrary date and start counting from there. Table 4-2 shows some source data.
whn | pages |
---|---|
2005-06-11 | 13368 |
2005-06-12 | 8129 |
2005-06-13 | 44043 |
... |
In Table 4-3, I've chosen the first day of the millennium, Monday, January 1, 2001, as day zero. Every date must be converted into the number of days since then. The mechanism for converting to integers is different on different engines. In MySQL, you can create this view using the TO_DAYS function:
CREATE VIEW webalizer2 AS SELECT TO_DAYS(whn)-TO_DAYS(DATE '2001-01-01') whn, pages FROM webalizer;
whn | pages |
---|---|
1622 | 13368 |
1623 | 8129 |
1624 | 44043 |
... |
With dates now represented by integers, you can perform arithmetic on them. Taking the modulus 7 value gives you the day of the week. Because 2001-01-01 was a Monday, you will get 0 on every seventh day from then. Tuesday will give you 1, Wednesday 2, and so on, with Sunday having the value 6.
4.2.1. Modular Arithmetic
Look at the values for whn%7 and FLOOR(whn/7). You can see that day number 1,622 (counting from 2001-01-01) is day number 5 of week number 231:
mysql> SELECT whn, whn%7, whn/7, FLOOR(whn/7) -> FROM webalizer2; +------+-------+----------+--------------+ | whn | whn%7 | whn/7 | FLOOR(whn/7) | +------+-------+----------+--------------+ | 1622 | 5 | 231.7143 | 231 | | 1623 | 6 | 231.8571 | 231 | | 1624 | 0 | 232.0000 | 232 | | 1625 | 1 | 232.1429 | 232 | | 1626 | 2 | 232.2857 | 232 | | 1627 | 3 | 232.4286 | 232 | | 1628 | 4 | 232.5714 | 232 | | 1629 | 5 | 232.7143 | 232 | | 1630 | 6 | 232.8571 | 232 | | 1631 | 0 | 233.0000 | 233 | | 1632 | 1 | 233.1429 | 233 | ...
You need to GROUP BY the whn%7 column to see the weekly cycle and GROUP BY the FLOOR(whn/7) column to see the trend.
To look at the intra-week pattern shown back in Figure 4-2, you take the average with GROUP BY whn%7:
mysql> SELECT whn%7, AVG(pages) -> FROM webalizer2 GROUP BY whn%7; +-------+------------+ | whn%7 | AVG(pages) | +-------+------------+ | 0 | 21391.6731 | | 1 | 23695.1538 | | 2 | 23026.2308 | | 3 | 24002.8077 | | 4 | 19773.9808 | | 5 | 10353.5472 | | 6 | 10173.9423 | +-------+------------+
To smooth out the data over the whole year, as shown in Figure 4-3, you can divide by 7 and take the integer value using the FLOOR function:
mysql> SELECT FLOOR(whn/7), AVG(pages) -> FROM webalizer2 GROUP BY FLOOR(whn/7); +--------------+------------+ | FLOOR(whn/7) | AVG(pages) | +--------------+------------+ | 231 | 10748.5000 | | 232 | 23987.8571 | | 233 | 19321.1429 | | 234 | 15347.0000 | ...
The value for the first week is artificially lowby chance, it includes two on only two days, and they are on weekends. Something similar might happen at the end of the interval, so it is safest to exclude any week that does not have seven entries. The HAVING clause will take care of that:
mysql> SELECT FLOOR(whn/7), AVG(pages) -> FROM webalizer2 GROUP BY FLOOR(whn/7) -> HAVING COUNT(*)=7; +--------------+------------+ | FLOOR(whn/7) | AVG(pages) | +--------------+------------+ | 232 | 23987.8571 | | 233 | 19321.1429 | | 234 | 15347.0000 | ...
This will work fine with MySQL and PostgreSQL, but you need to make a few alterations for SQL Server, Access, and Oracle.
4.2.2. SQL Server
Here's how to create the view that represents dates as integers:
CREATE VIEW webalizer2 AS SELECT CONVERT(INT,whn-'2001-01-01') whn, pages FROM webalizer
The SELECT statements shown earlier will run unmodified.
4.2.3. Access
In Access, you can use Int(whn - #2001-01-01#) to extract the number of days since January 1, 2001:
SELECT Int(whn - #2001-01-01#), pages FROM webalizer
Also, MOD is an infix operator used in place of %:
SELECT whn MOD 7, AVG(pages) FROM webalizer2 GROUP BY whn MOD 7;
4.2.4. Oracle
Here's how to create the view that represents dates as integers:
CREATE VIEW webalizer2 AS SELECT whn-DATE '2001-01-01' whn, pages FROM webalizer;
In Oracle, the module function is MOD, so you'd need to use that rather than whn%7:
SELECT MOD(whn,7), AVG(pages) FROM webalizer2 GROUP BY MOD(whn,7);
SQL Fundamentals
Joins, Unions, and Views
Text Handling
Date Handling
Number Crunching
Online Applications
Organizing Data
Storing Small Amounts of Data
Locking and Performance
Reporting
Users and Administration
Wider Access
Index