Approaches to Generating Data

for RuBoard

The two most important ingredients in making test data are the uniqueness/randomness of the data and the time it takes to produce. Random data is usually better, and shorter creation times are better. What we want to do is produce data that is as random or unique as possible in as short a time as possible. These two variables are usually inversely proportional to one another. That is, you increase one at the expense of the other. Increasing data randomness also increases the time it takes to produce. Similarly, improving the time it takes to generate test data usually lessens its uniqueness.

Although the examples we'll cover typically only generate 100 rows of data, you can extend them to generate practically any number of rows. I've purposely avoided showing you the obvious technique of looping on a variable and inserting a single row with each iteration because that approach is slow and resource- intensive . Each row inserted creates another entry in the transaction log. The technique is impractical for generating extremely large numbers of rows because it's too slow and because of its negative effect on the transaction log.

Cross Join

The first approach I'll show you is the cross-join approach. It involves synthesizing data by leveraging SQL's ability to return a Cartesian product (the result of multiplying one table by another). Here's an example:

 CREATE TABLE #list (id int identity) INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES  SELECT * FROM #list L1 CROSS JOIN #list L2  GO DROP TABLE #list 

(Results abridged)

 id          id ----------- ----------- 1           1 2           1 3           1 4           1 5           1 6           1 ... 6           10 7           10 8           10 9           10 10          10 (100 row(s) affected) 

The key line is the SELECT itself (in bold type). It's assumed that the table from which we're selecting would normally already exist.

You can increase the number of CROSS JOINs to multiply the number of rows generated. For example, to generate 1,000 rows instead of 100 in the previous example, we'd need only to set up another CROSS JOIN to the #List table. Through the power of exponential multiplication, you can quickly generate large volumes of data this way.

This technique works great for returning the data in a particular table, but what if we want to generate something besides what's already in a table? Here's a variation on the cross-join technique that returns completely new data:

 CREATE TABLE #list (id int identity) INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES SELECT IDENTITY(int, 1,1) AS Id, RAND(L1.id) AS RNum INTO #list2 FROM #list L1 CROSS JOIN #list L2 SELECT * FROM #list2 GO DROP TABLE #list,#list2 

(Results abridged)

 Id          RNum ----------- ----------------------------------------------------- 1           0.71359199321292355 2           0.71359199321292355 3           0.71359199321292355 4           0.71359199321292355 5           0.71359199321292355 6           0.71359199321292355 ... 96          0.71375968995424732 97          0.71375968995424732 98          0.71375968995424732 99          0.71375968995424732 100         0.71375968995424732 (100 row(s) affected) 

Here, we use the IDENTITY() function to generate a sequential integer for the first column. Note the semirandomness of the RNum column. Obviously, it's not quite random, but we get a good mix of values by using the RAND() function and passing it a column from the CROSS JOIN as a seed. Can we do better? Absolutely. Here's a variation that mixes the data a bit better:

 CREATE TABLE #list (id int identity) INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES SELECT IDENTITY(int, 1,1) AS Id, RAND(CASE WHEN L1.id % 2 = 0 THEN L1.id ELSE L2.id END) AS RNum INTO #list2 FROM #list L1 CROSS JOIN #list L2 SELECT * FROM #list2 GO DROP TABLE #list,#list2 

(Results abridged)

 Id          RNum ----------- ----------------------------------------------------- 1           0.71359199321292355 2           0.7136106261841817 3           0.71359199321292355 4           0.7136478921266981 5           0.71359199321292355 6           0.71368515806921451 ... 96          0.71368515806921451 97          0.71375968995424732 98          0.71372242401173092 99          0.71375968995424732 100         0.71375968995424732 (100 row(s) affected) 

As you can see, the data still isn't really random, but it's at least mixed up a little more randomly . Can we do better? Can we generate data that's much more random in a timely fashion? Certainly. Have a look at this variation:

 CREATE TABLE #list (id int identity) INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES SELECT IDENTITY(int, 1,1) AS Id INTO #list2 FROM #list L1 CROSS JOIN #list L2 SELECT Id, RAND(Id)*10000000000000 RNum FROM #list2 GO DROP TABLE #list,#list2 

(Results)

 Id          RNum ----------- ----------------------------------------------------- 1           7135919932129.2354 2           7136106261841.8174 3           7136292591554.3994 4           7136478921266.9814 5           7136665250979.5635 6           7136851580692.1455 ... 96          7153621254824.5244 97          7153807584537.1074 98          7153993914249.6885 99          7154180243962.2715 100         7154366573674.8525 (100 row(s) affected) 

Here we use the sequential number we generated using IDENTITY() as a seed to the RAND() function. This, of course, provides a unique seed for each invocation of the function and results in a random number for each row in the temporary table. Additionally, we multiply this random number by 10 trillion to return a positive real number between 1 and 10 trillion.

The only problem with this approach is that the random numbers generated are in order. For example, you could change the query to sort on the RNum column, like this:

 SELECT Id, RAND(Id)*10000000000000 RNum FROM #list2 ORDER BY RNum 

However, the order of the rows in the result set wouldn't change. That's a typical weakness of random number generators. Because they use a formula to generate numbers, they're susceptible to generating detectible patterns when supplied identical inputs or inputs that have a similar relationship to one another. In this case, the inputs have the same relationship to one another: Each Id value is one greater than the value before it in the table. One potential way around this is to create your own random number generator. We'll explore that possibility in the next section.

Random()

In addition to SQL Server's built-in functions, you can use UDFs to assist with generating random data. Here's a variation of the previous example that does just that:

 USE tempdb GO SET NOCOUNT ON GO DROP FUNCTION dbo.Random GO CREATE FUNCTION dbo.Random(@Seed int) RETURNS int AS /* Based on an algorithm from The Art of Computer Programming, Volume 2, Seminumerical Algorithms, by Donald Knuth, pp. 185-6. */ BEGIN   DECLARE @MM int, @AA int, @QQ int, @RR int, @MMM int, @AAA int,   @QQQ int, @RRR int, @Result int, @X decimal(38,0), @Y decimal(38,0)   SELECT @MM=2147483647, @AA=48271, @QQ=44488,      @RR=3399, @MMM=2147483399, @AAA=40692, @QQQ=52774, @RRR=3791   SET @X=@AA*(@Seed%@QQ)-@RR*CAST((@Seed/@QQ) AS int)   IF (@X<0) SET @X=@X+@MM   SET @Y=@AAA*(@Seed%@QQQ)-@RRR*CAST((@Seed/@QQQ) AS int)   IF (@Y<0) SET @Y=@Y+@MMM   SET @Result=@X-@Y   IF (@Result<=0) SET @Result=@Result+@MM   RETURN(@Result) END GO DROP FUNCTION dbo.ScrambleFloat GO CREATE FUNCTION dbo.ScrambleFloat(@Float float, @Seed float) RETURNS float AS BEGIN   DECLARE @VFloat as varbinary(8), @VFSFloat int, @Return float   SET @VFloat=CAST(@Float as varbinary(8))   SET @Return=dbo.Random(@Seed)*          ((CAST(CAST(SUBSTRING(@VFloat,5,1) AS int) AS float) *          (CAST(SUBSTRING(@VFloat,7,1) AS int)) /          ISNULL(NULLIF(CAST(SUBSTRING(@VFloat,6,1) AS int),0),1)) +          CAST(SUBSTRING(@VFloat,8,1) AS int) -          (CAST(SUBSTRING(@VFloat,3,1) AS int) %          ISNULL(NULLIF(CAST(SUBSTRING(@VFloat,1,1) AS int),0),1))+          (CAST(SUBSTRING(@VFloat,2,1) AS int) *          CASE WHEN            CAST(SUBSTRING(@VFloat,4,1) AS int) % 2 = 0 THEN 1 ELSE -1 END))   RETURN(@Return) END GO CREATE TABLE #list (id int identity) INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES SELECT IDENTITY(int, 1,1) AS Id INTO #list2 FROM #list L1 CROSS JOIN #list L2 SELECT Id, dbo.Random(Id) As Random, dbo.ScrambleFloat(Id, DATEPART(ms, GETDATE())) AS Scrambled FROM #list2 ORDER BY Scrambled GO DROP TABLE #list,#list2 

(Results)

 Id          Random      Scrambled ----------- ----------- ----------------------------------------------------- 29          219791      -722564027.05882347 44          333476      -673420171.23333335 13          98527       -574568537.4000001 28          212212      -544849779.53631282 74          560846      -523678459.07142854 98          742742      -507777842.00000006 96          727584      -477779538.77049184 ... 79          598741      3028523768.1111112 41          310739      3259160366.647059 25          189475      3645254956.2000003 81          613899      3701346966.7777781 53          401687      4101890347.4999995 27          204633      4471710800.7000008 55          416845      4934363189.25 

This code has several interesting features. First, note that we appear to get a random ordering of rows (it's not actually random, but at least it doesn't match the original order of the Id column). Second, note the use of the Random() UDF. We use it both in the SCRAMBLEFLOAT() function as well as to return a column in the result set. It uses an algorithm from Donald Knuth's three-volume work, The Art of Computer Programming , [2] to generate a random integer. We use it in SCRAMBLEFLOAT() because UDFs cannot use nondeterministic functions such as Transact -SQL's built-in RAND().

[2] Knuth, Donald. The Art of Computer Programming , Volume 2 , Seminumerical Algorithms . Reading, MA: Addison-Wesley, 1998. pp. 185186.

The SCRAMBLEFLOAT() function is also worth discussing. It literally takes a floating-point value apart and synthesizes a new value from it, using a seed value to help further randomize the process. It first casts the input value to a varbinary(8), then uses SUBSTRING() to disassemble it and turn it into a completely different floating-point value. It also randomizes the sign of the number so that the sign will vary somewhat from input value to input value. Although the values it returns should vary from run to run, their relationship to the input value and to the seed will not vary because a deterministic computation is used to transform the input value into the function result.

Run this code multiple times. What you should see is that the actual ordering of the rows does not change between runs, although the values returned by the SCRAMBLEFLOAT() function do. Even though the ordering doesn't change, it at least differs from the default ordering of the Id column, which is probably good enough for test data. That the ordering doesn't change is a result of the fact that the column on which we're ordering, although it appears to consist of random values, actually doesn't. SCRAMBLEFLOAT(), for all its machinations, returns predictable values. In other words, given a specific input value, SCRAMBLEFLOAT() will always return a value with a predictable relationship to the input value.

Doubling

Doubling inserting a table into itself until a desired size is reachedbears some similarities to the cross-tab techniques we've been examining. You'll find that this technique can produce fairly random data in a reasonable amount of time. Here's an example:

 SET NOCOUNT ON CREATE TABLE #list (id int identity, PlaceHolder int NULL) INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES INSERT #list DEFAULT VALUES SET NOCOUNT OFF DECLARE @cnt int, @rcnt int, @targ int SELECT @targ=100, @cnt=COUNT(*), @rcnt=@targ-COUNT(*) FROM #list WHILE @rcnt>0 BEGIN   SET ROWCOUNT @rcnt   INSERT #list (PlaceHolder) SELECT PlaceHolder FROM #list   SET @cnt=@cnt+@@ROWCOUNT   SET @rcnt=@targ-@cnt END SET ROWCOUNT 0 SELECT Id, RAND(Id)*10000000000000 FROM #list GO DROP TABLE #list 

(Results abridged)

 (10 row(s) affected) (20 row(s) affected) (40 row(s) affected) (20 row(s) affected) Id ----------- ----------------------------------------------------- 1           7135919932129.2354 2           7136106261841.8174 3           7136292591554.3994 4           7136478921266.9814 5           7136665250979.5635 6           7136851580692.1455 ... 96          7153621254824.5244 97          7153807584537.1074 98          7153993914249.6885 99          7154180243962.2715 100         7154366573674.8525 (100 row(s) affected) 

Here we begin by setting @targ to the target number of rows, and @cnt and @rcnt to the number of rows in the table and the number of rows remaining to insert respectively. We then loop, inserting the table into itself, until we've reached the target number of rows.

Note the use of SET ROWCOUNT to govern the number of rows inserted into the table. This keeps us from going beyond the target row count. We have to use SET ROWCOUNT here instead of SELECT TOP because SELECT TOP does not allow a variable to be used for the number of rows to return.

INSERT EXEC

Another technique for generating lots of data relatively quickly is to use INSERT EXEC to call a stored procedure that returns a large number of rows (even containing data that we don't care about using). Here's a query that first calls the SQLDIAG utility to create a report of your vital SQL Server statistics, then calls xp_cmdshell via INSERT EXEC to generate a series of sequential integers:

 SET NOCOUNT ON SET ANSI_WARNINGS OFF CREATE TABLE #list (id int identity, PlaceHolder char(1) NULL) EXEC master..xp_cmdshell 'sqldiag.exe -Oc:\temp\sqldiag.rpt',no_output SET ROWCOUNT 100 INSERT #list (PlaceHolder) EXEC master..xp_cmdshell 'TYPE c:\temp\sqldiag.rpt' SET ROWCOUNT 0 SELECT Id FROM #list GO DROP TABLE #list 

(Results abridged)

 Id ----------- 1 2 3 4 5 6 ... 96 97 98 99 100 

Here we use xp_cmdshell to list a text file using TYPE (we also create the file in advance using the SQLDIAG utility, but, normally, you wouldn't regenerate the file each time). We disable ANSI_WARNINGS because we're intentionally truncating the rows we're returning from the text file. It's a generation mechanism only, and we don't care about its contents. The PlaceHolder column is exactly what its name impliesa placeholder. It exists in the table so that we can insert a single character into it for each row returned by the stored procedure. Simultaneous with PlaceHolder getting populated , the table's Identity column will be generated, which is actually what we're interested in. We're using the call to xp_cmdshell to generate a series of rows very quickly.

Note the use of SET ROWCOUNT to limit the number of rows returned. SET ROWCOUNT affects INSERT EXEC just as it affects INSERT SELECT.

Naturally, calling the command shell via xp_cmdshell has performance overhead associated with it. The actual procedure called isn't important, though, as you'll see in just a moment. What's important is that you can quickly generate large amounts of data by calling a stored procedure via INSERT EXEC. Because you can't pass a table name into a procedure and insert data into it (without generating dynamic T-SQL, of course), the INSERT EXEC technique is a handy way to synthesize large data volumes quickly.

sp_generate_test_data

Sp_generate_test_data, listed in the example code that follows , allows us to insert any number of rows we want without needing a text file or other external data source. It takes a single parameter specifying the number of rows to generate, then returns them as a result set. These rows can then be inserted using INSERT EXEC. Here's the code:

 SET NOCOUNT ON USE master GO IF OBJECT_ID('dbo.sp_generate_test_data') IS NOT NULL   DROP PROC dbo.sp_generate_test_data GO CREATE PROC dbo.sp_generate_test_data @rowcount int AS DECLARE @var int DECLARE @table TABLE (Id int identity) SET @var=0 WHILE @var<@rowcount BEGIN   INSERT @table DEFAULT VALUES   SET @var=@var+1 END SELECT * FROM @table GO CREATE TABLE #list (id int) INSERT #list (Id) EXEC sp_generate_test_data 10 SELECT Id FROM #list GO DROP TABLE #list 

(Results)

 Id ----------- 1 2 3 4 5 6 7 8 9 10 

Here we use a stored procedure, sp_generate_test_data, to generate a series of integers, then insert those integers into a table. We specify the number of rows to generate as a parameter to the procedure. The procedure merely inserts the number of rows specified into a table variable, then returns the rows in the table variable.

This technique works great if you want to insert a sequential list of integers into a table, but what if you don't? What if you want to insert data that's fairly randomized and you don't need a sequential set of integers? Do you have to write a new sp_generate_test_data procedure for each type of data you want to return? You can probably avoid that with this technique:

 CREATE TABLE #list (PlaceHolder int, Random float DEFAULT (RAND()), Today datetime DEFAULT (GETDATE()) ) INSERT #list (PlaceHolder) EXEC sp_generate_test_data 10 SELECT Random, Today FROM #list 

(Results)

 Random                                             Today -------------------------------------------------- --------------------- 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 0.0629477210736374                                 2001-08-05 18:07:03.607 

As you can see, we generated a set of test data that doesn't actually use the integer returned by sp_generate_test_data. The rows generated aren't random, but at least we didn't have to set up a special query or stored procedure to create it.

Although most of the other techniques I've presented thus far will outperform the INSERT EXEC technique, it's more convenient than most of them. Using it can be as simple as constructing a basic INSERT statement and supplying the number of rows you want to generate via a parameter to sp_generate_test_data.

Table 6-1. Data Generation Timings (in milliseconds )
Rows WHILE loop CROSS JOIN CROSS JOIN RAND() RANDOM()/ SCRAMBLEFLOAT() Doubling INSERT EXEC
100 140 170 203 576 106 170
10,000 1393 173 213 4970 533 1670
100,000 13780 783 1313 48873 4716 16330
for RuBoard


The Guru[ap]s Guide to SQL Server[tm] Stored Procedures, XML, and HTML
The Guru[ap]s Guide to SQL Server[tm] Stored Procedures, XML, and HTML
ISBN: 201700468
EAN: N/A
Year: 2005
Pages: 223

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net