Understanding Statistical Functions | Beginning Transact-SQL With SQL Server 2000 and 2005

Statistics wasn't my best subject in school. Although I understood the relevance and importance of statistics, my brain just wasn't wired for it. Fortunately, now that I use these functions regularly in consulting and application development work, I no longer struggle with it, but I often need to jog my memory by looking at an example. This section explains these concepts in simple terms and provides some useful examples for those of us who don't think statistically.

The VAR() Function

This function returns the statistical variance for a range of values, that is, a value that indicates how "spread out" the values are in the range. The value returned by this function is actually the measure of how far the extreme low range or high range value is from the middle — or mean value of the range, weighted by the greatest concentration of similar values. For example, given the range of values on a number line, 2, 3, 4, 5, and 6, the number 4 is the mean — it's in the middle of the range. In this simple example, the variance of this range is 2 (from 4 to 2 and from 4 to 6 both have a difference of 2). This is very simple if you have a list of distinct, incremental values but it gets a little more complex as the values are less uniform.

Try It Out

You can do some simple experimenting with values in a single-column table created by running the following query:

 Create Table MyValues (MyValue Float)

Now, insert the values given in the previous example, using this query:

 Insert Into MyValues (MyValue) SELECT 2  Insert Into MyValues (MyValue) SELECT 3  Insert Into MyValues (MyValue) SELECT 4  Insert Into MyValues (MyValue) SELECT 5  Insert Into MyValues (MyValue) SELECT 6

To return the variance of this range, use this query:

 SELECT VAR(MyValue) FROM MyValues

If you insert more values close to the center of the range, you will see that this changes the outcome. This is because the result is computed as the average squared deviation (difference) of each number from its mean. This is done so negative numbers behave appropriately and to weight the equation toward the center of the greatest concentration of values. Regardless, it's a standard statistical function and, fortunately, you probably don't need to concern yourself with the specifics of the internal calculation. Calculating variance is the first step in performing other statistical functions, such as standard deviation.

As you can see, using integer values to keep things simple, you've created a bell-curve around the mean value, 4:

 INSERT INTO MyValues (MyValue) SELECT 3  INSERT INTO MyValues (MyValue) SELECT 4  INSERT INTO MyValues (MyValue) SELECT 4  INSERT INTO MyValues (MyValue) SELECT 4  INSERT INTO MyValues (MyValue) SELECT 5

You then return the deviation for the range again:

 SELECT VAR(MyValue) FROM MyValues

This reduces the value of the standard deviation to indicate that values, on average, are less spread out.

The VARP() Function

The variance over a population is simply another indicator of this same principle, using a different formula. This formula is sometimes called biased estimate of variance. Although this method is used in some complex calculations, the other form of variance is more common.

The STDEV() Function

Have you ever taken a class where the teacher graded on a curve? If so, you were the victim of standard deviation (or, perhaps, the benefactor).

The standard deviation is a calculation based on the variance of a numeric range of values. Actually, it's simply the square root of the variance. In a normal distribution, values can be plotted in a bell-curve, the mean value represented by the center of the curve. If you were to slice off the center of the curve, taking about 68% of the most common values, this would represent the standard deviation (or "first standard deviation"). If you were to move outward the same variation of values, you would take off another 27% (a total of 95%), leaving only 5%.

Standard deviation is an effective method for analyzing and making sense of large distributions of data. It is also a common method to calculate risk and probability.

To measure the standard deviation for your sample values table, simply use the following query:

 SELECT STDEV(MyValue) FROM MyValues

The result, 1.1547. . . , tells you that for the values in your table, those values that are in the range from 2.8453 to 5.1547 (within 1.1547 of the mean) are in the first standard deviation.

Using the AdventureWorks sales data, you can apply this analysis to bicycle sales. Suppose that the director of marketing asks "How much did most of our customers pay for road bikes in 2003?" Just modify the query you used before, using the STDEV() function like this:

 SELECT   STDEV(UnitPrice)  FROM     SalesOrderHeader  INNER JOIN SalesOrderDetail ON  SalesOrderHeader.SalesOrderID =SalesOrderDetail.SalesOrderID  INNER JOIN Product ON SalesOrderDetail.ProductID = Product.ProductID WHERE    Product.ProductSubCategoryID = 2  AND SalesOrderHeader.OrderDate BETWEEN ‘1-1-

The result is 636.54. This means that most of your customers paid between $479.50 and $ 1,752.58 for their road bikes — at least those purchases in the first standard deviation.

The STDDEVP() Function

This function calculates standard deviation based on the variance of a population.

User-Defined Aggregate Functions

SQL Server 2005 allows application developers to add custom aggregate functions to a database. These functions are written in a .NET programming language, such as C# or Visual Basic.NET, and must be compiled into a .NET assembly using the Microsoft .NET Common Language Runtime. As a SQL query designer, all you need to know is that once deployed and correctly configured, you can use these functions in your queries as you would any of the system-supplied aggregate functions.