Recipe12.8.Creating a Predefined Number of Buckets | SQL Cookbook (Cookbooks (OReilly))

Recipe 12.8. Creating a Predefined Number of Buckets

Problem

You want to organize your data into a fixed number of buckets. For example, you want to organize the employees in table EMP into four buckets. The result set should look similar to the following:

 GRP EMPNO ENAME --- ----- ---------   1  7369 SMITH   1  7499 ALLEN   1  7521 WARD   1  7566 JONES   2  7654 MARTIN   2  7698 BLAKE   2  7782 CLARK   2  7788 SCOTT   3  7839 KING   3  7844 TURNER   3  7876 ADAMS   4  7900 JAMES   4  7902 FORD   4  7934 MILLER

This problem is the opposite of the previous recipe, where you had an unknown number of buckets but a predetermined number of elements in each bucket. In this recipe, the goal is such that you may not necessarily know how many elements are in each bucket, but you are defining a fixed (known) number of buckets to be created.

Solution

The solution to this problem is trivial if your RDBMS provides functions for creating "buckets" of rows. If your RDBMS provides no such functions, you can simply rank each row, and then use the modulus of said rank and n, where n is the number of buckets you wish to create, in an expression to determine into which bucket the row falls. Where available, this solution will make use of the NTILE window function for creating a fixed number of buckets. NTILE organizes an ordered set into the number of buckets you specify, with any stragglers distributed into the available buckets starting from the first bucket. The desired result set for this recipe reflects this: buckets 1 and 2 have four rows while buckets 3 and 4 have three rows. If your RDBMS does not support NTILE, don't worry about which rows are in which buckets; the main goal of this recipe is to create the fixed number of buckets you are requesting.

DB2

Use the window function ROW_NUMBER OVER window function to rank the rows by EMPNO, then use the modulus of the rank and 4 to create four buckets:

 1 select mod(row_number( )over(order by empno),4)+1 grp, 2        empno, 3        ename 4   from emp 5  order by 1

Oracle and SQL Server

The DB2 solution will work for these vendors but alternatively (conveniently) you may use the NTILE window function to create four buckets:

 1 select ntile(4)over(order by empno) grp, 2        empno, 3        ename 4   from emp

MySQL, and PostgreSQL

Use a self join to rank the rows by EMPNO, then use the modulus of the rank and 4 to create your buckets:

 1 select mod(count(*),4)+1 as grp, 2        e.empno, 3        e.ename 4   from emp e, emp d 5  where e.empno >= d.empno 6  group by e.empno,e.ename 7  order by 1

Discussion

DB2

The first step is to use the window function ROW_NUMBER OVER to rank each row by EMPNO:

  select row_number( )over(order by empno) grp,        empno,        ename   from emp GRP EMPNO ENAME --- ----- ------   1  7369 SMITH   2  7499 ALLEN   3  7521 WARD   4  7566 JONES   5  7654 MARTIN   6  7698 BLAKE   7  7782 CLARK   8  7788 SCOTT   9  7839 KING  10  7844 TURNER  11  7876 ADAMS  12  7900 JAMES  13  7902 FORD  14  7934 MILLER

Now that the rows are ranked, use the modulo function, MOD, to create four buckets:

  select mod(row_number( )over(order by empno),4) grp,        empno,        ename   from emp GRP EMPNO ENAME --- ----- ------   1  7369 SMITH   2  7499 ALLEN   3  7521 WARD   0  7566 JONES   1  7654 MARTIN   2  7698 BLAKE   3  7782 CLARK   0  7788 SCOTT   1  7839 KING   2  7844 TURNER   3  7876 ADAMS   0  7900 JAMES   1  7902 FORD   2  7934 MILLER

The last step is to add one GRP so the buckets start at 1, not 0, and use ORDER BY on GRP to order the rows by bucket.

Oracle and SQL Server

All the work is done by the NTILE function. Simply pass it a number representing the number of buckets you want, and watch the magic unfold right in front of your eyes.

MySQL and PostgreSQL

The fist step is to generate a Cartesian product with table EMP so that each EMPNO can be compared with every other EMPNO [only a snippet of the Cartesian is shown below because there would be 196 rows returned (14x14)]:

  select e.empno,         e.ename,         d.empno,         d.ename   from emp e, emp d EMPNO ENAME           EMPNO ENAME ----- ---------- ---------- ---------  7369 SMITH            7369 SMITH  7369 SMITH            7499 ALLEN  7369 SMITH            7521 WARD  7369 SMITH            7566 JONES  7369 SMITH            7654 MARTIN  7369 SMITH            7698 BLAKE  7369 SMITH            7782 CLARK  7369 SMITH            7788 SCOTT  7369 SMITH            7839 KING  7369 SMITH            7844 TURNER  7369 SMITH            7876 ADAMS  7369 SMITH            7900 JAMES  7369 SMITH            7902 FORD  7369 SMITH            7934 MILLER  …

As you can see from this result set, you can compare SMITH's EMPNO to the EMPNO of all the other employees in EMP (you can compare each employee's EMPNO with all the other employees' EMPNOs). The next step is to restrict the Cartesian product to only those EMPNOs that are greater than or equal to another EMPNO. A portion of the result set (as there are 105 rows) is shown below:

  select e.empno,        e.ename,        d.empno,        d.ename   from emp e, emp d  where e.empno >= d.empno EMPNO ENAME           EMPNO ENAME ----- ---------- ---------- ----------  7934 MILLER           7934 MILLER  7934 MILLER           7902 FORD  7934 MILLER           7900 JAMES  7934 MILLER           7876 ADAMS  7934 MILLER           7844 TURNER  7934 MILLER           7839 KING  7934 MILLER           7788 SCOTT  7934 MILLER           7782 CLARK  7934 MILLER           7698 BLAKE  7934 MILLER           7654 MARTIN  7934 MILLER           7566 JONES  7934 MILLER           7521 WARD  7934 MILLER           7499 ALLEN  7934 MILLER           7369 SMITH  …  7499 ALLEN            7499 ALLEN  7499 ALLEN            7369 SMITH  7369 SMITH            7369 SMITH

Of the entire result set, I've included only rows (from EMP E) for MILLER, ALLEN, and SMITH in this output. The reason is to show you how the Cartesian product has been restricted by the WHERE clause. Because the filter on EMPNO in the WHERE clause uses "greater than or equal to," you know you will get at least one row for each employee because each EMPNO is equal to itself. But why is there only one row for SMITH (on the left-hand side of the result set) when there are two rows for ALLEN and 14 rows for MILLER? The reason is the compound evaluation on EMPNO in the WHERE clause: "greater than or equal to". In SMITH's case, there is no EMPNO that 7369 is greater than, so only one row is returned for SMITH. In ALLEN's case, ALLEN's EMPNO is obviously equal to itself (so that row is returned), but 7499 is also greater than 7369 (SMITH's EMPNO) so two rows are returned for ALLEN. In the case of MILLER's EMPNO 7934, it is greater than all the other EMPNOs in table EMP (and obviously equal to itself ) so there are 14 MILLER rows returned.

Now you can compare each EMPNO and determine which ones are greater than others. Use the aggregate function COUNT to return the self join as a more expressive result set:

  select count(*) as grp,        e.empno,        e.ename   from emp e, emp d  where e.empno >= d.empno  group by e.empno,e.ename  order by 1 GRP      EMPNO ENAME --- ---------- ----------  1        7369 SMITH  2        7499 ALLEN  3        7521 WARD  4        7566 JONES  5        7654 MARTIN  6        7698 BLAKE  7        7782 CLARK  8        7788 SCOTT  9        7839 KING 10        7844 TURNER 11        7876 ADAMS 12        7900 JAMES 13        7902 FORD 14        7934 MILLER

Now that the rows are ranked, simply add 1 to the modulus of GRP and 4 to create four buckets (adding 1 so the buckets start at 1, not 0). Use the ORDER BY clause on GRP to order the buckets appropriately:

  select mod(count(*),4)+1 as grp,        e.empno,        e.ename   from emp e, emp d  where e.empno >= d.empno  group by e.empno,e.ename  order by 1 GRP      EMPNO ENAME --- ---------- ---------   1       7900 JAMES   1       7566 JONES   1       7788 SCOTT   2       7369 SMITH   2       7902 FORD   2       7654 MARTIN   2       7839 KING   3       7499 ALLEN   3       7698 BLAKE   3       7934 MILLER   3       7844 TURNER   4       7521 WARD   4       7782 CLARK   4       7876 ADAMS