Section 7.4. Following Statistical Patterns

7.4. Following Statistical Patterns

Suppose that you are building a table of customers and populating it with names that will be used later in searches. How would you generate the values? For example, should the values be random collections of characters like the following?

     -32nr -32nr3121ne -e21e     323-=11r- r     0-vmdw-dwv0-[o- rr0-32r2 0     r4i32r -rm32r3p=x ewifef-432fr32o3-==

I got these values by merely pecking randomly at the keyboard. But it's unlikely that these values would serve well as sample customer names. After all, names have some acceptable variation (such as Jim or Jane), and when you develop a table of test data, it's important that those values reflect possible data as much as possible. Your values should be from a set of known values, but should occur randomly within that set, without following a predictable pattern.

Another aspect to consider is the variance . Very few names are unique. For instance, in the United States, you will find frequent occurrences of the male names John, James, or Scott (but few cases of names like Arup). So when you generate test data, you usually want to make sure that the distribution is random but also that it follows a real-world statistical model. For instance, let's say that in our population the distribution of first names should look like the following:

10%	`Alan`
10%	`Barbara`
5%	`Charles`
5%	`David`
15%	`Ellen`
20%	`Frank`
10%	`George`
5%	`Hillary`
10%	`Iris`
10%	`Josh`

When populating the column FIRST_NAME in the table ACCOUNTS according to this pattern, therefore, 10% of the records should have the FIRST_NAME Alan, 10% Barbara, and so on. Let's further assume that we want to populate the rest of the columns as well to represent real-world scenarios. First let's see the columns of the modified ACCOUNTS table.

     SQL> DESC accounts      Name              Null?    Type      ----------------- -------- ------------      ACC_NO            NOT NULL NUMBER      FIRST_NAME        NOT NULL VARCHAR2(30)      LAST_NAME         NOT NULL VARCHAR2(30)      ACC_TYPE          NOT NULL VARCHAR2(1)      FOLIO_ID                   NUMBER      SUB_ACC_TYPE               VARCHAR2(30)      ACC_OPEN_DT       NOT NULL DATE      ACC_MOD_DT                 DATE      ACC_MGR_ID                 NUMBER

Suppose that to make the data distribution realistic, I have to create the data to follow the column pattern shown in the following table.

Column name	Purpose	Data pattern
ACC_NO	Account number	Any number with fewer than 10 digits
FIRST_NAME	First name	10% - Alan
		10% -Barbara
		5% - Charles
		5% - David
		15% - Ellen
		20% - Frank
		10% - George
		5% - Hillary
		10% - Iris
		10% - Josh
LAST_NAME	Last name	Any alphabetic character between 4 and 30, but 25% should be "Smith"
ACC_TYPE	Type of account: savings, checking, etc.	20% each of S, C, M, D, and X
FOLIO_ID	Folio ID from the other systems	One-half NULL and the other half a number related to the account number
SUB_ACC_TYPE	If the customer is incorporated, then sub-account types, if any	75% null. From the values populated:
		5% - S
		20% - C
ACC_OPEN_DT	Date account was opened	A date between now and 500 days ago
ACC_MGR_ID	ID of the account manager servicing the account	There are 5 account managers, with the percent of accounts as follows:
		1 - 40%
		2 - 10%
		3 - 10%
		4 - 10%
		5 - 30%

As you can see, these are moderately complex requirements; they accurately reflect how the data will be distributed in a real database. In real life, there will be customers with the first names "Josh" and "Ellen," not "XepqjEuF," so the names must be chosen from the set of possible names. In the United States, there are many different last names. Here, I want a semi-random distribution with 25% of a very popular last name, "Smith." The rest of the last names can be scattered.

So how can I generate data that conforms to this complex set of rules? I will borrow a page or two from probability theory and follow the path of statisticians, who often use the Monte Carlo simulation . With this approach, I generate a random number, between 1 and 100 (inclusive). Over a long period of time, the probability that a specific number, such as 6, will be generated is exactly one out of 100, or 1%. In fact, all the numbers have 1/100 probability of being generated. Using the same approach, the probability of either of two numbers, such as 1 and 2, will be 2%; and that of any of three numbers1, 2, or 3will be 3%. And, of course, the probability that any one of the numbers between 1 and 10 will turn up is 10%. I will use these rules to configure the probability of the random value generated.

7.4.1. Generating Strings

Take, for instance, the value of the column ACC_TYPE, which calls for equal probability of S, C, M, D, and X, or 20% probability each. If we generate a whole number between 1 and 5 (inclusive), the probability of each generated number will be 20%. Then I can use a SQL DECODE function to get the ACC_TYPE value based on the number generated.

       1  SELECT DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 6)),       2                 1, 'S',       3                 2, 'C',       4                 3, 'M',       5                 4, 'D',       6                 'X'       7                )       8*   FROM DUAL;

Let's see what's going on here. First, I am generating a number between 1 and 5 (line 1). Because the number generated is less than the highest value passed as a parameter, I have specified 6. Because I want a whole number, I use the FLOOR function in line 1. It truncates all decimal values from the generated number. Depending on the number obtained, I use DECODE to get one of the valuesS, C, M, D, or X. Because the numbers 1, 2, 3, 4, and 5 all have an equal probability of getting generated, so will the lettersat 20% each.

This technique is very useful in generating random but predetermined values of the type shown in the example. The same approach can be used to generate almost all types of predetermined random values.

7.4.2. Generating Random Values with NULLs

Remember from the discussion earlier that the requirement for the column FOLIO_ID is a little different. It needs only 50% of the values populated; the rest should be NULL. How can I achieve that result?

I will use the same probability approach with a twist; I will simply use a determination of "yes" or "no." Generating a random number between 1 and 100 will ensure 1% probability of each number. Hence, a number below 51 will have exactly 50% probability. I can use this in a CASE statement to get the value:

     SQL> SELECT CASE       2            WHEN DBMS_RANDOM.VALUE (1, 100) < 51       3               THEN NULL       4            ELSE FLOOR (DBMS_RANDOM.VALUE (1, 100))       5         END       6* FROM DUAL;

On line 2, I check if the number generated is less than 51. If so, I return NULL. Because the probability of a number under 51 is 50%, I have NULLs occurring 50% of the time, as well. In the other 50%, I have generated a random number between 1 and 100 to be used as a FOLIO_ID.

7.4.3. Generating Random Strings of Random Length

In DMBS_RANDOM.STRING, the string generated is random but of fixed length. In reality, however, people have last names of varying lengths. In this next example, the requirement is to have a length between 4 and 30 characters. To accommodate this requirement, I can pass the length as a random number as well to the STRING function in line 6 below.

       1  BEGIN       2     FOR i IN 1 .. 10       3     LOOP       4        DBMS_OUTPUT.put_line (   'Random String='       5                  || DBMS_RANDOM.STRING ('A'       6                                        , DBMS_RANDOM.VALUE (4, 30)       7                                                    )       8                             );       9     END LOOP;      10* END;

The output is:

     Random String=RniQZGquFVJYFpGLOvtNd     Random String=GhcphpcsaCXlhigRQY     Random String=JtakoelUf     Random String=BgCOu     Random String=QFBzQxcHqGlHWkZFmnN     Random String=lSxVjqJvpwBB     Random String=jfhNARzALrLOKZRpOwnhrzz     Random String=KuFtdJcqQpjkrFmzFbzcXnYFGjWo     Random String=BhuZ     Random String=GebcqcgvzBfEpTYnJPmYAQdb

Note how the generated strings are not only random but also of randomly different lengths.

I also need 25% of the last names to be "Smith" and the rest to have random values. I can accomplish that by combining the random strings and the Monte Carlo simulation as follows:

        DECODE (              FLOOR(DBMS_RANDOM.value(1,5)),                 1,'Smith',                 DBMS_RANDOM.string ('A',DBMS_RANDOM.value(4,30))           )

This expression will return "Smith" 25% of the time and a random alphabetic string between 4 and 30 characters the rest of the time.

7.4.4. Putting It All Together

Now that you understand the building blocks of the randomization approach, you can put them together to build the account-record-generation PL/SQL code segment shown below. In this example, I load 100,000 records into the table ACCOUNTS. Here is the complete loading program.

     /* File on web: ins_acc.sql */     BEGIN        FOR l_acc_no IN 1 .. 100000        LOOP           INSERT INTO accounts                VALUES (l_acc_no,                        -- First Name                        DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 21)),                                1, 'Alan',                                2, 'Alan',                                3, 'Barbara',                                4, 'Barbara',                                5, 'Charles',                                6, 'David',                                7, 'Ellen',                                8, 'Ellen',                                9, 'Ellen',                                10, 'Frank',                                11, 'Frank',                                12, 'Frank',                                13, 'George',                                14, 'George',                                15, 'George',                                16, 'Hillary',                                17, 'Iris',                                18, 'Iris',                                19, 'Josh',                                20, 'Josh',                                'XXX'                               ),                        -- Last Name                        DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 5)),                                1, 'Smith',                                DBMS_RANDOM.STRING ('A'                                                  , DBMS_RANDOM.VALUE (4, 30))                               ),                        -- Account Type                        DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 5)),                                1, 'S',                                2, 'C',                                3, 'M',                                4, 'D',                                'X'                               ),                        -- Folio ID                        CASE                           WHEN DBMS_RANDOM.VALUE (1, 100) < 51                              THEN NULL                           ELSE l_acc_no + FLOOR (DBMS_RANDOM.VALUE (1, 100))                        END,                        -- Sub Acc Type                        CASE                           WHEN DBMS_RANDOM.VALUE (1, 100) < 76                              THEN NULL                           ELSE DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 6)),                                        1, 'S',                                        2, 'C',                                        3, 'C',                                        4, 'C',                                        5, 'C',                                        NULL                                       )                        END,                        -- Acc Opening Date                        SYSDATE - DBMS_RANDOM.VALUE (1, 500),                        -- Acc Mod Date                        SYSDATE,                        -- Account Manager ID                        DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 11)),                                1, 1,                                2, 1,                                3, 1,                                4, 1,                                5, 2,                                6, 3,                                7, 4,                                8, 5,                                9, 5,                                10, 5,                                0                               ));        END LOOP;        COMMIT;     END;

How do I know that all these exercises yield the required result? The proof is in the pudding. After this table has been loaded, let's examine the actual distribution:

     SQL> SELECT   first_name, COUNT (*)       2      FROM accounts       3  GROUP BY first_name       4  ORDER BY first_name       5  /     FIRST_NAME                       COUNT(*)     ------------------------------ ----------     Alan                                 9766     Barbara                             10190     Charles                              5066     David                                5000     Ellen                               15023     Frank                               15109     George                              14913     Hillary                              5019     Iris                                 9932     Josh                                 9982     10 rows selected.

Each first name is exactly as per the desired distribution. For example, remember that I wanted to have 10% of the rows contain the first name "Alan," and I got 9,766 out of 100,000, which equates to approximately 10%? I wanted 15% containing "Ellen" and I got 15.023%, pretty close to that number and statistically significant. You can continue to check through all of the other columns, verifying that they are actually distributed as per the required pattern.