Hack 20. Uncover Trends in Your Data

Statistics gathered daily could contain both daily cycles and weekly trends. This can lead to chaotic-looking graphs when activity is plotted day by day. You can improve your graphs easily using SQL.

Look at Figure 4-1, which shows a chart of the raw figures for the number of page views for a web site per day, over the course of one year. These figures come from Webalizer, the web log analysis program (http://www.mrunix.net/webalizer). It is difficult to see the trends because the weekly cycle overwhelms the daily detail, and obscures the long-term trend.

Figure 4-1. Page views per day over one year

To understand the data, you need to separate the effect of the weekly cycle from the table. You can see the weekly cycle by taking the average for Monday, the average for Tuesday, and so forth. In Figure 4-2, Monday to Sunday are numbered 0 to 6.

Figure 4-2. Average page views by day of week

Notice that the value of the Sunday column (column 6) is less than half the value of the midweek columns. This is helping to cause the zigzag pattern in the original graph. If you view the data averaged per week (see Figure 4-3) rather than per day, it is easier to see the long-term trend.

Figure 4-3. Smoothed data page views averaged for one week

 The graphs and charts shown here come from Excel. Many spreadsheet applications, including Excel, have tools for importing directly from databases and producing a variety of graphical reports.

Before you can isolate these trends, you must turn the dates into integers to more easily put them into the appropriate buckets. You can pick an arbitrary date and start counting from there. Table 4-2 shows some source data.

Table 4-2. Page views by date

whn pages
2005-06-11 13368
2005-06-12 8129
2005-06-13 44043
...

In Table 4-3, I've chosen the first day of the millennium, Monday, January 1, 2001, as day zero. Every date must be converted into the number of days since then. The mechanism for converting to integers is different on different engines. In MySQL, you can create this view using the TO_DAYS function:

```CREATE VIEW webalizer2 AS
SELECT TO_DAYS(whn)-TO_DAYS(DATE '2001-01-01') whn, pages
FROM webalizer;```

Table 4-3. Converting dates to integers

whn pages
1622 13368
1623 8129
1624 44043
...

With dates now represented by integers, you can perform arithmetic on them. Taking the modulus 7 value gives you the day of the week. Because 2001-01-01 was a Monday, you will get 0 on every seventh day from then. Tuesday will give you 1, Wednesday 2, and so on, with Sunday having the value 6.

4.2.1. Modular Arithmetic

Look at the values for whn%7 and FLOOR(whn/7). You can see that day number 1,622 (counting from 2001-01-01) is day number 5 of week number 231:

```mysql> SELECT whn, whn%7, whn/7, FLOOR(whn/7)
-> FROM webalizer2;
+------+-------+----------+--------------+
| whn | whn%7 | whn/7 | FLOOR(whn/7) |
+------+-------+----------+--------------+
| 1622 | 5 | 231.7143 | 231 |
| 1623 | 6 | 231.8571 | 231 |
| 1624 | 0 | 232.0000 | 232 |
| 1625 | 1 | 232.1429 | 232 |
| 1626 | 2 | 232.2857 | 232 |
| 1627 | 3 | 232.4286 | 232 |
| 1628 | 4 | 232.5714 | 232 |
| 1629 | 5 | 232.7143 | 232 |
| 1630 | 6 | 232.8571 | 232 |
| 1631 | 0 | 233.0000 | 233 |
| 1632 | 1 | 233.1429 | 233 |
...```

You need to GROUP BY the whn%7 column to see the weekly cycle and GROUP BY the FLOOR(whn/7) column to see the trend.

To look at the intra-week pattern shown back in Figure 4-2, you take the average with GROUP BY whn%7:

```mysql> SELECT whn%7, AVG(pages)
-> FROM webalizer2 GROUP BY whn%7;
+-------+------------+
| whn%7 | AVG(pages) |
+-------+------------+
| 0 | 21391.6731 |
| 1 | 23695.1538 |
| 2 | 23026.2308 |
| 3 | 24002.8077 |
| 4 | 19773.9808 |
| 5 | 10353.5472 |
| 6 | 10173.9423 |
+-------+------------+```

To smooth out the data over the whole year, as shown in Figure 4-3, you can divide by 7 and take the integer value using the FLOOR function:

```mysql> SELECT FLOOR(whn/7), AVG(pages)
-> FROM webalizer2 GROUP BY FLOOR(whn/7);
+--------------+------------+
| FLOOR(whn/7) | AVG(pages) |
+--------------+------------+
| 231 | 10748.5000 |
| 232 | 23987.8571 |
| 233 | 19321.1429 |
| 234 | 15347.0000 |
...```

The value for the first week is artificially lowby chance, it includes two on only two days, and they are on weekends. Something similar might happen at the end of the interval, so it is safest to exclude any week that does not have seven entries. The HAVING clause will take care of that:

```mysql> SELECT FLOOR(whn/7), AVG(pages)
-> FROM webalizer2 GROUP BY FLOOR(whn/7)
-> HAVING COUNT(*)=7;
+--------------+------------+
| FLOOR(whn/7) | AVG(pages) |
+--------------+------------+
| 232 | 23987.8571 |
| 233 | 19321.1429 |
| 234 | 15347.0000 |
...```

This will work fine with MySQL and PostgreSQL, but you need to make a few alterations for SQL Server, Access, and Oracle.

4.2.2. SQL Server

Here's how to create the view that represents dates as integers:

```CREATE VIEW webalizer2 AS
SELECT CONVERT(INT,whn-'2001-01-01') whn, pages
FROM webalizer```

The SELECT statements shown earlier will run unmodified.

4.2.3. Access

In Access, you can use Int(whn - #2001-01-01#) to extract the number of days since January 1, 2001:

```SELECT Int(whn - #2001-01-01#), pages
FROM webalizer```

Also, MOD is an infix operator used in place of %:

```SELECT whn MOD 7, AVG(pages)
FROM webalizer2 GROUP BY whn MOD 7;```

4.2.4. Oracle

Here's how to create the view that represents dates as integers:

```CREATE VIEW webalizer2 AS
SELECT whn-DATE '2001-01-01' whn, pages
FROM webalizer;```

In Oracle, the module function is MOD, so you'd need to use that rather than whn%7:

```SELECT MOD(whn,7), AVG(pages)
FROM webalizer2 GROUP BY MOD(whn,7);```

SQL Hacks
ISBN: 0596527993
EAN: 2147483647
Year: 2004
Pages: 147
Simiral book on Amazon