Group Functions and the GROUP BY Clause

This section explains how you can group rows together and perform some kind of aggregate operation on them. For example, you may want to count the rows for a given condition, calculate averages of numeric columns, or find the highest or lowest value for a given column in a query result.

aggregate

A type of function in Oracle SQL that performs a calculation or transformation across multiple rows in a table, rather than just on a single row.

The GROUP BY clause fits into the SELECT statement as follows:

SELECT * | {[DISTINCT] column | expression [alias]          | group_function(column), ...}   FROM tablename   [WHERE condition ... ]   [GROUP BY group_expression, group_expression ...]   [ORDER BY column [ASC | DESC], column [ASC | DESC], ...];

All group functions ignore NULLs by default. If you wanted to calculate the average commission across employees, you would most likely not want to consider employees who are not in the sales area (and therefore have a NULL commission value). On the other hand, you might want to treat NULL values numerically in other situations. You will see later in this chapter how you can convert NULL values with the NVL function.

Group Functions

Table 4.2 lists some of the most commonly used group functions in SQL statements. The COUNT function is the only aggregate function that will count rows with NULL values in any column when * is used as an argument.

Table 4.2: Common Group Functions
Function	Description
COUNT	Counts the number of rows, either all rows or for non-NULL column values
AVG	Calculates the average value of a column
SUM	Returns the sum of values for a column
MIN	Returns the minimum value for all column values
MAX	Returns the maximum value for all column values
STDDEV	Calculates the standard deviation for a specified column

All of the functions listed in Table 4.2 have a calling sequence as follows:

 function([DISTINCT | ALL] expression)

As mentioned earlier, the COUNT function allows for * as its only argument, to specify that rows are to be counted, whether or not they have NULL values. The COUNT, MIN, and MAX functions apply to date and string expressions in addition to numeric expressions; the rest must have numeric arguments only.

The DISTINCT keyword indicates that duplicates are to be removed before the aggregate calculation is done. For example, calculating AVG(SALARY) versus AVG(DISTINCT SALARY) would be quite different if most of the employees are at one end of the pay scale. ALL is the default.

The boss, King, wants to get more information on salary distribution by department, so he asks Janice to give him the count of employees and the average salary and commission for his department, which has a department ID of 90. Janice runs the following query:

select count(*), avg(salary),   avg(commission_pct) from employees   where department_id = 90;   COUNT(*) AVG(SALARY) AVG(COMMISSION_PCT) ---------- ----------- -------------------          3  19333.3333 1 row selected.

Notice that the average commission in this case is not zero, but NULL; there were no employees in department 90 with a commission. The result would have been non-NULL, if there were at least one employee who worked on a commission for part of their salary.

The next morning, the boss asks the same question for department 80, which has the bulk of the commissioned employees. Janice gets the answer with this query:

select count(*), avg(salary),   avg(commission_pct) from employees   where department_id = 80;   COUNT(*) AVG(SALARY) AVG(COMMISSION_PCT) ---------- ----------- -------------------         34  8955.88235                .225 1 row selected.

Janice hears rumors that King is going to ask for a breakdown of the number of employees, how many are on commission, and how many distinct commission percentages there are. She comes up with this query:

select count(*), count(commission_pct) "Comm Count",   count(distinct commission_pct) "Distinct Comm"   from employees;   COUNT(*) Comm Count Distinct Comm ---------- ---------- -------------        107         35             7 1 row selected.

What does this tell King? The total number of employees is 107, regardless of whether there are any NULL values in any of the columns. Of those employees, 35 are on commission (have a non-NULL value for COMMISSION_PCT), and out of those 35, there are seven different commission levels in force at the company.

Janice also suspects that King will be asking for some statistics for other departments. Rather than run the same query for different department numbers, she decides that it might be worthwhile to use the GROUP BY function to give King all the information he needs in a single query.

The GROUP BY Clause

The GROUP BY clause is used to break down the results of a query based on a column or columns. Once the rows are subdivided into groups, the aggregate functions described earlier in this chapter can be applied to these groups. Note the following rules about using the GROUP BY clause:

All columns in a SELECT statement that are not in the GROUP BY clause must be part of an aggregate function.
The WHERE clause can be used to filter rows from the result before the grouping functions are applied.
The GROUP BY clause also specifies the sort order; this can be overridden with an ORDER BY clause.
Column aliases cannot be used in the GROUP BY clause.

Janice has been busy preparing a report for King that will break down the salary and commission information by department. Her first query looks like this:

select department_id "Dept", count(*), avg(salary),   avg(commission_pct) from employees   group by department_id;  Dept   COUNT(*) AVG(SALARY) AVG(COMMISSION_PCT) ----- ---------- ----------- -------------------    10          1        4400    20          2        9500    30          6        4150    40          1        6500    50         45  3475.55556    60          5        5760    70          1       10000    80         34  8955.88235                .225    90          3  19333.3333   100          6        8600   110          2       10150                1        7000                 .15 12 rows selected.

This gives King a breakdown, by department, of the employee count, the average salary, and the average commission. NULLs are not included in the calculation for commission or salary. King likes this report, but Janice suspects that he will be asking for something different tomorrow.

One of the departments has a NULL value. There is one employee who has not yet been assigned to a department, but this employee does have a salary and a commission.

As expected, King calls the next day with another request. He wants to see how the salaries and commissions break out within department by job function. Janice realizes that all she needs to do is to add the job ID to the query in both the SELECT clause and the GROUP BY clause:

select department_id "Dept", job_id "Job", count(*),   avg(salary), avg(commission_pct) from employees   group by department_id, job_id;  Dept Job          COUNT(*) AVG(SALARY) AVG(COMMISSION_PCT) ----- ---------- ---------- ----------- -------------------       SA_REP              1        7000                 .15    10 AD_ASST             1        4400    20 MK_MAN              1       13000    20 MK_REP              1        6000    30 PU_MAN              1       11000    30 PU_CLERK            5        2780    40 HR_REP              1        6500    50 ST_MAN              5        7280    50 SH_CLERK           20        3215    50 ST_CLERK           20        2785    60 IT_PROG             5        5760    70 PR_REP              1       10000    80 SA_MAN              5       12200                  .3    80 SA_REP             29  8396.55172          .212068966    90 AD_VP               2       17000    90 AD_PRES             1       24000   100 FI_MGR              1       12000   100 FI_ACCOUNT          5        7920   110 AC_MGR              1       12000   110 AC_ACCOUNT          1        8300 20 rows selected.

As a side benefit, this also gives King the breakdown of jobs within each department.

Using NVL with Group Functions

As mentioned earlier in this chapter, group functions will ignore NULL values in their calculations. In most cases, this makes a lot of sense. For example, if only a small handful of employees worked on commission, and you calculated the average commission with the assumption that a NULL commission was essentially a zero commission, then the average commission would be quite low!

How you should interpret NULL values in a column depends on the business rules of the company and what NULL values represent. An average commission is usually based on only those employees who work on commission, and, in this case, the default behavior of Oracle’s grouping functions makes sense.

However, there may be times when it makes sense to convert NULL values to something that can be aggregated. Let’s assume for the moment that there is a column called COMMISSION_AMT in the EMPLOYEES table that records the latest monthly commission received by that employee. Just as with the COMMISSION_PCT column, the COMMISSION_AMT field is NULL for all employees except those in the Sales department. If King wanted a report of the average salary and commission (if any) by department, the expression

avg(salary + commission_amt)

in the SELECT clause would give results for only those rows with non-NULL commissions. That would not be what King was looking for. Janice would need to essentially convert any NULL values to zero. This is what NVL will do, and the expression above can be rewritten as:

avg(salary + nvl(commission_amt,0))

For each row, if the COMMISSION_AMT is NULL, it is converted to zero (or any other amount you want) and added to SALARY, and the average is returned after all rows have been read.

The HAVING Clause

The HAVING clause is analogous to the WHERE clause, except that the HAVING clause applies to aggregate functions instead of individual columns or single-row function results. A query with a HAVING clause still returns aggregate values, but those aggregated summary rows are filtered from the query output based on the conditions in the HAVING clause.

The HAVING clause fits into the SELECT syntax as follows:

SELECT * | {[DISTINCT] column | expression [alias]   | group_function(column), ...}   FROM tablename   [WHERE condition ... ]   [GROUP BY group_expression, group_expression ...]   [HAVING group_condition, ...]   [ORDER BY column [ASC | DESC], column [ASC | DESC], ...];

The queries that Janice wrote for King have the information he needs, but his time is limited and he only wants to see the breakdowns for the department and job combinations that have average salaries over $10,000. Janice takes the original query

select department_id "Dept", job_id "Job", count(*),   avg(salary), avg(commission_pct) from employees group by department_id, job_id;

and adds a HAVING clause that removes the lower average salaries:

select department_id "Dept", job_id "Job", count(*),   avg(salary), avg(commission_pct) from employees group by department_id, job_id having avg(salary) > 10000;  Dept Job          COUNT(*) AVG(SALARY) AVG(COMMISSION_PCT) ----- ---------- ---------- ----------- -------------------    20 MK_MAN              1       13000    30 PU_MAN              1       11000    80 SA_MAN              5       12200                  .3    90 AD_VP               2       17000    90 AD_PRES             1       24000   100 FI_MGR              1       12000   110 AC_MGR              1       12000 7 rows selected.

Janice becomes proactive again, and she anticipates that King will want to see only certain jobs in the report. She can easily add a WHERE clause to select only administrative and sales positions. She uses the LIKE clause to select these job functions:

select department_id "Dept", job_id "Job", count(*),   avg(salary), avg(commission_pct) from employees where job_id like ‘AD%’ or job_id like ‘SA%’ group by department_id, job_id having avg(salary) > 10000;  Dept Job          COUNT(*) AVG(SALARY) AVG(COMMISSION_PCT) ----- ---------- ---------- ----------- -------------------    80 SA_MAN              5       12200                  .3    90 AD_VP               2       17000    90 AD_PRES             1       24000 3 rows selected.

The order of the WHERE, GROUP, and HAVING clauses does not change how the query is run or the results; however, the ordering shown here is indicative of how the SQL engine processes the command. If an ORDER BY clause was needed in the above query, it could be placed anywhere after the SELECT clause, but would most logically belong at the end of the query.