Analyzing a Time Series


Analyzing a time series is one of the most complicated, but also one of the most interesting, topics. A time series is analyzed for many reasons, but what is a time series and what is it used for?

This section answers this question with the help of an example. In nearly all countries around the globe, people have to pay income tax. Information about people, such as their income and gender, is collected by the state. The data is used for many purposes, including computing the amount of taxes a person has to pay and generating statistics, which are often the bases for political decisions. The collected data is usually stored as a time series, and with the help of special algorithms, it is possible to get a lot of information out of the data. Retrieving statistical information out of a huge amount of data is called data mining.

Many books about data mining and algorithms have been written in the past, and it is impossible to provide a full coverage here. This section presents an overview of how a time series can be treated with PostgreSQL and stored efficiently in tables.

Let's create a table called timeseries and insert some data. The following creates a labor market table:

  CREATE TABLE timeseries(id serial, name text, start date, finish date, status text);  

The table consists of five columns. The first column is used as the unique identifier of a row. The second column contains the name of the person. Columns three and four define the period of time the person was either employed or unemployed (column five).

 timeseries=#  SELECT * FROM timeseries;  id   name      start       finish      status ----+---------+------------+------------+------------   1  Paul     1989-10-12  1999-08-03  employed   2  Paul     1999-08-04  2001-11-23  employed   3  Jana     1970-01-01  2002-08-12  employed   4  Jana     2002-08-12  2002-08-14  unemployed   5  Charles  1993-03-21  1995-10-01  unemployed   6  Charles  1995-10-02  2002-01-09  employed (6 rows) 

We will use this data to perform some operations.

Let's start with a simple example. We want to know how many people were employed on January 1, 2000:

  SELECT COUNT(*)   FROM timeseries   WHERE start<='2000-1-1'::date   AND finish>='2000-1-1'::date   AND status='employed';  

The database will return 3 , because all three persons in the database were employed. Things become a bit more complicated if we add one record to the database:

 7  Paul     2000-01-01  2000-01-04  employed 

Paul is employed by a second company and is therefore employed twice on January 1, 2000. It depends on what we want to retrieve from the database. If we use this query, the result is 4 . If we want to count persons instead of jobs, we use a different query:

  SELECT COUNT(*) INTO TEMPORARY TABLE temp_1   FROM timeseries   WHERE start<='2000-1-1'::date   AND finish>='2000-1-1'::date   AND status='employed'   GROUP BY name;  

In general, two comparatively easy solutions for the problem can be found, but both consist of two queries. The previous query selects data in a temporary table. Here's what can be found in this table:

 timeseries=#  SELECT * FROM temp_1;  count -------      1      1      2 (3 rows) 

The table contains three records that tell us how often a certain person is employed on January 1, 2001.

The same result can be achieved by defining a view using the same SELECT statement:

  CREATE VIEW unique_person AS SELECT COUNT(*)   FROM timeseries   WHERE start<='2000-1-1'::date   AND finish>='2000-1-1'::date   AND status='employed'   GROUP BY name;  

The desired result can now easily be calculated by counting the records in the table or in the view:

 timeseries=#  SELECT COUNT(*) FROM temp_1;  count -------      3 (1 row) 

The result is, as we expect, 3 .

This example shows how quickly simple problems can become very complex. The result of simple queries must always be questioned to make sure that the result is correct. Generating useful data is not a question of being a hardcore SQL programmer. Often it is much more important to know what the data contains and to make sure that you have taken every possible situation into consideration. The biggest problem with data mining is that in most cases people do not know which information has to be extracted from the data. Nearly all questions can be answered , as long as you know what you want to find out.

In the next example, we find all people who were employed after being unemployed. This query leads to an interesting result:

  SELECT a.name   FROM timeseries AS a, timeseries AS b   WHERE a.status='unemployed'   AND b.status='employed'   AND a.name=b.name;  

Two names are returned by the query:

 name ---------  Jana  Charles (2 rows) 

Jana has not been employed after being unemployed. She was employed and lost her job. Because we want to find people who find a job, this query is wrong.

Let's try the following query:

  SELECT a.name   FROM timeseries AS a, timeseries AS b   WHERE a.status='unemployed'   AND b.status='employed'   AND a.name=b.name   AND a.finish<b.start;  

We have added an additional condition to the query that makes sure that only Charles is selected. The person has to be unemployed before being employed; otherwise , the query returns a wrong result.

When executing the SQL command, we receive only one record:

 name ---------  Charles (1 row) 

To make sure that everything is right, we select Charles from the database:

 timeseries=#  SELECT * FROM timeseries WHERE name='Charles';  id   name      start       finish      status ----+---------+------------+------------+------------   5  Charles  1993-03-21  1995-10-01  unemployed   6  Charles  1995-10-02  2002-01-09  employed (2 rows) 

You can see, that his employment started exactly one day after his unemployment.

Assume that we want to find out the average number of days that people were employed in a certain year. We perform the query for the year 2002. The easiest way to solve the problem is to write a simple PL/pgSQL function:

 CREATE FUNCTION avg_days(date, date) RETURNS int4 AS '         DECLARE                 v_start date;                 v_end   date;                 result int4;         BEGIN                 result  := 0;                 v_start := ;                 v_end   := ;                 IF      (v_start < ''2002-1-1''::date) THEN                         v_start := ''2002-1-1''::date;                 END IF;                 IF      (v_end > ''2002-12-31''::date) THEN                         v_end := ''2002-12-31''::date;                 END IF;                 result := result + (v_end - v_start) + 1;                 RETURN result;         END; ' LANGUAGE 'plpgsql'; 

Normally, dates are not hard-coded in the function, because this would be far too inflexible . Nevertheless, we do some hard-coding here to make the function easier to understand.

Let's have a detailed look at this function. Two parameters have to be passed to the function. The first parameter is the field where the starting time of a period is defined. The second column contains the end of a certain period. After defining and declaring the required variables , we set the starting time of a record to January 1, 2002. The end also needs to be changed. This operation is necessary because we want to perform a subtraction; we want to know only how many days in 2002 a person was employed. Then the result is calculated for a certain record. Finally, we have to increment the result of the subtraction by 1 .

Note

If we have a period that lasts from January 1 to January 2nd, it is 2 days long. If we subtract January 2 from January 1, the result is 1 day. Therefore we have to add one day to the result.


Now we use the function described previously:

  SELECT id, avg_days(start, finish)   FROM timeseries   WHERE start<'2002-1-1'::date   AND finish>'2001-12-31'::date;  

All records that do not contain a valid interval have to be removed, which is done in the WHERE clause. Two records will be returned:

 id  avg_days ----+----------   3       224   6         9 (2 rows) 

Jana and Charles were employed in 2002. This query shows the number of days each of the two has been working. If we want to find the average number of days now, we have to add all days and divide it by the number of persons:

  SELECT SUM(avg_days(start, finish))/   (SELECT COUNT(*)   FROM timeseries   WHERE start<'2002-1-1'::date   AND finish>'2001-12-31'::date)   FROM timeseries   WHERE start<'2002-1-1'::date   AND finish>'2001-12-31'::date;  

In the previous example, we use a subquery to find out how many people are affected. The following shows that the query returns the right result:

 ?column? ----------------  116.5000000000 (1 row) 

If we look at the execution plan of the query, we can make a critical observation:

 timeseries=#  EXPLAIN SELECT SUM(avg_days(start, finish))/(SELECT COUNT(*) FROM   timeseries WHERE start<'2002-1-1'::date AND finish>'2001-12-31'::date) FROM   timeseries   WHERE start<'2002-1-1'::date AND finish>'2001-12-31'::date;  NOTICE:  QUERY PLAN: Aggregate  (cost=1.11..1.11 rows=1 width=8)   InitPlan     ->  Aggregate  (cost=1.11..1.11 rows=1 width=0)           ->  Seq Scan on timeseries  (cost=0.00..1.10 rows=1 width=0)   ->  Seq Scan on timeseries  (cost=0.00..1.10 rows=1 width=8) EXPLAIN 

The database has to perform a sequential scan on timeseries twice, so the query is extremely slow. In real-world applications, the problem is avoided by performing all major calculations on the application level. Databases are very fast for searching, but are in most cases not suitable for performing complex operations. If the amount of data that has to be processed becomes really big, you can run into trouble very soon ”processing a query that has to do two full table scans will be much too slow.

Another important piece of information is the average time a person has worked in a certain job:

 timeseries=#  SELECT AVG(finish-start) FROM timeseries WHERE status='employed';  avg -----------------  3725.8000000000 (1 row) 

The SQL code is simple. We just calculate the difference between the beginning and the end and compute the average value. The difference is displayed in days.

PostgreSQL is as good for analyzing a time series as any other database available. Generally, databases are extremely fast for searching and retrieving data, but not for performing complex operations and calculations. Often, things have to be done on the application level to achieve reasonable performance or to keep the code simple. SQL is a language designed to select data and not to do exception handling as it is done with an ordinary programming language. If you need some sort of exception handling, you can use PL/pgSQL; but if you want to build really big functions, PL/pgSQL can be troublesome because it is sometimes hard to debug.



PostgreSQL Developer's Handbook2001
PostgreSQL Developer's Handbook2001
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 125

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net