Recipe 12.8. Creating a Predefined Number of BucketsProblemYou want to organize your data into a fixed number of buckets. For example, you want to organize the employees in table EMP into four buckets. The result set should look similar to the following: GRP EMPNO ENAME --- ----- --------- 1 7369 SMITH 1 7499 ALLEN 1 7521 WARD 1 7566 JONES 2 7654 MARTIN 2 7698 BLAKE 2 7782 CLARK 2 7788 SCOTT 3 7839 KING 3 7844 TURNER 3 7876 ADAMS 4 7900 JAMES 4 7902 FORD 4 7934 MILLER This problem is the opposite of the previous recipe, where you had an unknown number of buckets but a predetermined number of elements in each bucket. In this recipe, the goal is such that you may not necessarily know how many elements are in each bucket, but you are defining a fixed (known) number of buckets to be created. SolutionThe solution to this problem is trivial if your RDBMS provides functions for creating "buckets" of rows. If your RDBMS provides no such functions, you can simply rank each row, and then use the modulus of said rank and n, where n is the number of buckets you wish to create, in an expression to determine into which bucket the row falls. Where available, this solution will make use of the NTILE window function for creating a fixed number of buckets. NTILE organizes an ordered set into the number of buckets you specify, with any stragglers distributed into the available buckets starting from the first bucket. The desired result set for this recipe reflects this: buckets 1 and 2 have four rows while buckets 3 and 4 have three rows. If your RDBMS does not support NTILE, don't worry about which rows are in which buckets; the main goal of this recipe is to create the fixed number of buckets you are requesting. DB2Use the window function ROW_NUMBER OVER window function to rank the rows by EMPNO, then use the modulus of the rank and 4 to create four buckets: 1 select mod(row_number( )over(order by empno),4)+1 grp, 2 empno, 3 ename 4 from emp 5 order by 1 Oracle and SQL ServerThe DB2 solution will work for these vendors but alternatively (conveniently) you may use the NTILE window function to create four buckets: 1 select ntile(4)over(order by empno) grp, 2 empno, 3 ename 4 from emp MySQL, and PostgreSQLUse a self join to rank the rows by EMPNO, then use the modulus of the rank and 4 to create your buckets: 1 select mod(count(*),4)+1 as grp, 2 e.empno, 3 e.ename 4 from emp e, emp d 5 where e.empno >= d.empno 6 group by e.empno,e.ename 7 order by 1 DiscussionDB2The first step is to use the window function ROW_NUMBER OVER to rank each row by EMPNO: select row_number( )over(order by empno) grp, empno, ename from emp GRP EMPNO ENAME --- ----- ------ 1 7369 SMITH 2 7499 ALLEN 3 7521 WARD 4 7566 JONES 5 7654 MARTIN 6 7698 BLAKE 7 7782 CLARK 8 7788 SCOTT 9 7839 KING 10 7844 TURNER 11 7876 ADAMS 12 7900 JAMES 13 7902 FORD 14 7934 MILLER Now that the rows are ranked, use the modulo function, MOD, to create four buckets: select mod(row_number( )over(order by empno),4) grp, empno, ename from emp GRP EMPNO ENAME --- ----- ------ 1 7369 SMITH 2 7499 ALLEN 3 7521 WARD 0 7566 JONES 1 7654 MARTIN 2 7698 BLAKE 3 7782 CLARK 0 7788 SCOTT 1 7839 KING 2 7844 TURNER 3 7876 ADAMS 0 7900 JAMES 1 7902 FORD 2 7934 MILLER The last step is to add one GRP so the buckets start at 1, not 0, and use ORDER BY on GRP to order the rows by bucket. Oracle and SQL ServerAll the work is done by the NTILE function. Simply pass it a number representing the number of buckets you want, and watch the magic unfold right in front of your eyes. MySQL and PostgreSQLThe fist step is to generate a Cartesian product with table EMP so that each EMPNO can be compared with every other EMPNO [only a snippet of the Cartesian is shown below because there would be 196 rows returned (14x14)]: select e.empno, e.ename, d.empno, d.ename from emp e, emp d EMPNO ENAME EMPNO ENAME ----- ---------- ---------- --------- 7369 SMITH 7369 SMITH 7369 SMITH 7499 ALLEN 7369 SMITH 7521 WARD 7369 SMITH 7566 JONES 7369 SMITH 7654 MARTIN 7369 SMITH 7698 BLAKE 7369 SMITH 7782 CLARK 7369 SMITH 7788 SCOTT 7369 SMITH 7839 KING 7369 SMITH 7844 TURNER 7369 SMITH 7876 ADAMS 7369 SMITH 7900 JAMES 7369 SMITH 7902 FORD 7369 SMITH 7934 MILLER … As you can see from this result set, you can compare SMITH's EMPNO to the EMPNO of all the other employees in EMP (you can compare each employee's EMPNO with all the other employees' EMPNOs). The next step is to restrict the Cartesian product to only those EMPNOs that are greater than or equal to another EMPNO. A portion of the result set (as there are 105 rows) is shown below: select e.empno, e.ename, d.empno, d.ename from emp e, emp d where e.empno >= d.empno EMPNO ENAME EMPNO ENAME ----- ---------- ---------- ---------- 7934 MILLER 7934 MILLER 7934 MILLER 7902 FORD 7934 MILLER 7900 JAMES 7934 MILLER 7876 ADAMS 7934 MILLER 7844 TURNER 7934 MILLER 7839 KING 7934 MILLER 7788 SCOTT 7934 MILLER 7782 CLARK 7934 MILLER 7698 BLAKE 7934 MILLER 7654 MARTIN 7934 MILLER 7566 JONES 7934 MILLER 7521 WARD 7934 MILLER 7499 ALLEN 7934 MILLER 7369 SMITH … 7499 ALLEN 7499 ALLEN 7499 ALLEN 7369 SMITH 7369 SMITH 7369 SMITH Of the entire result set, I've included only rows (from EMP E) for MILLER, ALLEN, and SMITH in this output. The reason is to show you how the Cartesian product has been restricted by the WHERE clause. Because the filter on EMPNO in the WHERE clause uses "greater than or equal to," you know you will get at least one row for each employee because each EMPNO is equal to itself. But why is there only one row for SMITH (on the left-hand side of the result set) when there are two rows for ALLEN and 14 rows for MILLER? The reason is the compound evaluation on EMPNO in the WHERE clause: "greater than or equal to". In SMITH's case, there is no EMPNO that 7369 is greater than, so only one row is returned for SMITH. In ALLEN's case, ALLEN's EMPNO is obviously equal to itself (so that row is returned), but 7499 is also greater than 7369 (SMITH's EMPNO) so two rows are returned for ALLEN. In the case of MILLER's EMPNO 7934, it is greater than all the other EMPNOs in table EMP (and obviously equal to itself ) so there are 14 MILLER rows returned. Now you can compare each EMPNO and determine which ones are greater than others. Use the aggregate function COUNT to return the self join as a more expressive result set: select count(*) as grp, e.empno, e.ename from emp e, emp d where e.empno >= d.empno group by e.empno,e.ename order by 1 GRP EMPNO ENAME --- ---------- ---------- 1 7369 SMITH 2 7499 ALLEN 3 7521 WARD 4 7566 JONES 5 7654 MARTIN 6 7698 BLAKE 7 7782 CLARK 8 7788 SCOTT 9 7839 KING 10 7844 TURNER 11 7876 ADAMS 12 7900 JAMES 13 7902 FORD 14 7934 MILLER Now that the rows are ranked, simply add 1 to the modulus of GRP and 4 to create four buckets (adding 1 so the buckets start at 1, not 0). Use the ORDER BY clause on GRP to order the buckets appropriately: select mod(count(*),4)+1 as grp, e.empno, e.ename from emp e, emp d where e.empno >= d.empno group by e.empno,e.ename order by 1 GRP EMPNO ENAME --- ---------- --------- 1 7900 JAMES 1 7566 JONES 1 7788 SCOTT 2 7369 SMITH 2 7902 FORD 2 7654 MARTIN 2 7839 KING 3 7499 ALLEN 3 7698 BLAKE 3 7934 MILLER 3 7844 TURNER 4 7521 WARD 4 7782 CLARK 4 7876 ADAMS |