7.4. Following Statistical PatternsSuppose that you are building a table of customers and populating it with names that will be used later in searches. How would you generate the values? For example, should the values be random collections of characters like the following? -32nr -32nr3121ne -e21e 323-=11r- r 0-vmdw-dwv0-[o- rr0-32r2 0 r4i32r -rm32r3p=x ewifef-432fr32o3-== I got these values by merely pecking randomly at the keyboard. But it's unlikely that these values would serve well as sample customer names. After all, names have some acceptable variation (such as Jim or Jane), and when you develop a table of test data, it's important that those values reflect possible data as much as possible. Your values should be from a set of known values, but should occur randomly within that set, without following a predictable pattern. Another aspect to consider is the variance . Very few names are unique. For instance, in the United States, you will find frequent occurrences of the male names John, James, or Scott (but few cases of names like Arup). So when you generate test data, you usually want to make sure that the distribution is random but also that it follows a real-world statistical model. For instance, let's say that in our population the distribution of first names should look like the following:
When populating the column FIRST_NAME in the table ACCOUNTS according to this pattern, therefore, 10% of the records should have the FIRST_NAME Alan, 10% Barbara, and so on. Let's further assume that we want to populate the rest of the columns as well to represent real-world scenarios. First let's see the columns of the modified ACCOUNTS table. SQL> DESC accounts Name Null? Type ----------------- -------- ------------ ACC_NO NOT NULL NUMBER FIRST_NAME NOT NULL VARCHAR2(30) LAST_NAME NOT NULL VARCHAR2(30) ACC_TYPE NOT NULL VARCHAR2(1) FOLIO_ID NUMBER SUB_ACC_TYPE VARCHAR2(30) ACC_OPEN_DT NOT NULL DATE ACC_MOD_DT DATE ACC_MGR_ID NUMBER Suppose that to make the data distribution realistic, I have to create the data to follow the column pattern shown in the following table.
As you can see, these are moderately complex requirements; they accurately reflect how the data will be distributed in a real database. In real life, there will be customers with the first names "Josh" and "Ellen," not "XepqjEuF," so the names must be chosen from the set of possible names. In the United States, there are many different last names. Here, I want a semi-random distribution with 25% of a very popular last name, "Smith." The rest of the last names can be scattered. So how can I generate data that conforms to this complex set of rules? I will borrow a page or two from probability theory and follow the path of statisticians, who often use the Monte Carlo simulation . With this approach, I generate a random number, between 1 and 100 (inclusive). Over a long period of time, the probability that a specific number, such as 6, will be generated is exactly one out of 100, or 1%. In fact, all the numbers have 1/100 probability of being generated. Using the same approach, the probability of either of two numbers, such as 1 and 2, will be 2%; and that of any of three numbers1, 2, or 3will be 3%. And, of course, the probability that any one of the numbers between 1 and 10 will turn up is 10%. I will use these rules to configure the probability of the random value generated. 7.4.1. Generating StringsTake, for instance, the value of the column ACC_TYPE, which calls for equal probability of S, C, M, D, and X, or 20% probability each. If we generate a whole number between 1 and 5 (inclusive), the probability of each generated number will be 20%. Then I can use a SQL DECODE function to get the ACC_TYPE value based on the number generated. 1 SELECT DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 6)), 2 1, 'S', 3 2, 'C', 4 3, 'M', 5 4, 'D', 6 'X' 7 ) 8* FROM DUAL; Let's see what's going on here. First, I am generating a number between 1 and 5 (line 1). Because the number generated is less than the highest value passed as a parameter, I have specified 6. Because I want a whole number, I use the FLOOR function in line 1. It truncates all decimal values from the generated number. Depending on the number obtained, I use DECODE to get one of the valuesS, C, M, D, or X. Because the numbers 1, 2, 3, 4, and 5 all have an equal probability of getting generated, so will the lettersat 20% each. This technique is very useful in generating random but predetermined values of the type shown in the example. The same approach can be used to generate almost all types of predetermined random values. 7.4.2. Generating Random Values with NULLsRemember from the discussion earlier that the requirement for the column FOLIO_ID is a little different. It needs only 50% of the values populated; the rest should be NULL. How can I achieve that result? I will use the same probability approach with a twist; I will simply use a determination of "yes" or "no." Generating a random number between 1 and 100 will ensure 1% probability of each number. Hence, a number below 51 will have exactly 50% probability. I can use this in a CASE statement to get the value: SQL> SELECT CASE 2 WHEN DBMS_RANDOM.VALUE (1, 100) < 51 3 THEN NULL 4 ELSE FLOOR (DBMS_RANDOM.VALUE (1, 100)) 5 END 6* FROM DUAL; On line 2, I check if the number generated is less than 51. If so, I return NULL. Because the probability of a number under 51 is 50%, I have NULLs occurring 50% of the time, as well. In the other 50%, I have generated a random number between 1 and 100 to be used as a FOLIO_ID. 7.4.3. Generating Random Strings of Random LengthIn DMBS_RANDOM.STRING, the string generated is random but of fixed length. In reality, however, people have last names of varying lengths. In this next example, the requirement is to have a length between 4 and 30 characters. To accommodate this requirement, I can pass the length as a random number as well to the STRING function in line 6 below. 1 BEGIN 2 FOR i IN 1 .. 10 3 LOOP 4 DBMS_OUTPUT.put_line ( 'Random String=' 5 || DBMS_RANDOM.STRING ('A' 6 , DBMS_RANDOM.VALUE (4, 30) 7 ) 8 ); 9 END LOOP; 10* END; The output is: Random String=RniQZGquFVJYFpGLOvtNd Random String=GhcphpcsaCXlhigRQY Random String=JtakoelUf Random String=BgCOu Random String=QFBzQxcHqGlHWkZFmnN Random String=lSxVjqJvpwBB Random String=jfhNARzALrLOKZRpOwnhrzz Random String=KuFtdJcqQpjkrFmzFbzcXnYFGjWo Random String=BhuZ Random String=GebcqcgvzBfEpTYnJPmYAQdb Note how the generated strings are not only random but also of randomly different lengths. I also need 25% of the last names to be "Smith" and the rest to have random values. I can accomplish that by combining the random strings and the Monte Carlo simulation as follows: DECODE ( FLOOR(DBMS_RANDOM.value(1,5)), 1,'Smith', DBMS_RANDOM.string ('A',DBMS_RANDOM.value(4,30)) ) This expression will return "Smith" 25% of the time and a random alphabetic string between 4 and 30 characters the rest of the time. 7.4.4. Putting It All TogetherNow that you understand the building blocks of the randomization approach, you can put them together to build the account-record-generation PL/SQL code segment shown below. In this example, I load 100,000 records into the table ACCOUNTS. Here is the complete loading program. /* File on web: ins_acc.sql */ BEGIN FOR l_acc_no IN 1 .. 100000 LOOP INSERT INTO accounts VALUES (l_acc_no, -- First Name DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 21)), 1, 'Alan', 2, 'Alan', 3, 'Barbara', 4, 'Barbara', 5, 'Charles', 6, 'David', 7, 'Ellen', 8, 'Ellen', 9, 'Ellen', 10, 'Frank', 11, 'Frank', 12, 'Frank', 13, 'George', 14, 'George', 15, 'George', 16, 'Hillary', 17, 'Iris', 18, 'Iris', 19, 'Josh', 20, 'Josh', 'XXX' ), -- Last Name DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 5)), 1, 'Smith', DBMS_RANDOM.STRING ('A' , DBMS_RANDOM.VALUE (4, 30)) ), -- Account Type DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 5)), 1, 'S', 2, 'C', 3, 'M', 4, 'D', 'X' ), -- Folio ID CASE WHEN DBMS_RANDOM.VALUE (1, 100) < 51 THEN NULL ELSE l_acc_no + FLOOR (DBMS_RANDOM.VALUE (1, 100)) END, -- Sub Acc Type CASE WHEN DBMS_RANDOM.VALUE (1, 100) < 76 THEN NULL ELSE DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 6)), 1, 'S', 2, 'C', 3, 'C', 4, 'C', 5, 'C', NULL ) END, -- Acc Opening Date SYSDATE - DBMS_RANDOM.VALUE (1, 500), -- Acc Mod Date SYSDATE, -- Account Manager ID DECODE (FLOOR (DBMS_RANDOM.VALUE (1, 11)), 1, 1, 2, 1, 3, 1, 4, 1, 5, 2, 6, 3, 7, 4, 8, 5, 9, 5, 10, 5, 0 )); END LOOP; COMMIT; END; How do I know that all these exercises yield the required result? The proof is in the pudding. After this table has been loaded, let's examine the actual distribution: SQL> SELECT first_name, COUNT (*) 2 FROM accounts 3 GROUP BY first_name 4 ORDER BY first_name 5 / FIRST_NAME COUNT(*) ------------------------------ ---------- Alan 9766 Barbara 10190 Charles 5066 David 5000 Ellen 15023 Frank 15109 George 14913 Hillary 5019 Iris 9932 Josh 9982 10 rows selected. Each first name is exactly as per the desired distribution. For example, remember that I wanted to have 10% of the rows contain the first name "Alan," and I got 9,766 out of 100,000, which equates to approximately 10%? I wanted 15% containing "Ellen" and I got 15.023%, pretty close to that number and statistically significant. You can continue to check through all of the other columns, verifying that they are actually distributed as per the required pattern. |