Combining SAS Data Sets: Methods


Concatenating

Definition

Concatenating data sets is the combining of two or more data sets, one after the other, into a single data set. The number of observations in the new data set is the sum of the number of observations in the original data sets. The order of observations is sequential. All observations from the first data set are followed by all observations from the second data set, and so on.

In the simplest case, all input data sets contain the same variables. If the input data sets contain different variables, observations from one data set have missing values for variables defined only in other data sets. In either case, the variables in the new data set are the same as the variables in the old data sets.

Syntax

Use this form of the SET statement to concatenate data sets:

SET data-set(s) ;

where

data-set

specifies any valid SAS data set name .

For a complete description of the SET statement, see SAS Language Reference: Dictionary .

DATA Step Processing During Concatenation

Compilation phase

SAS reads the descriptor information of each data set that is named in the SET statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.

Execution - Step 1

SAS reads the first observation from the first data set into the program data vector. It processes the first observation and executes other statements in the DATA step. It then writes the contents of the program data vector to the new data set.

The SET statement does not reset the values in the program data vector to missing, except for variables whose value is calculated or assigned during the DATA step. Variables that are created by the DATA step are set to missing at the beginning of each iteration of the DATA step. Variables that are read from a data set are not.

Execution - Step 2

SAS continues to read one observation at a time from the first data set until it finds an end-of-file indicator. The values of the variables in the program data vector are then set to missing, and SAS begins reading observations from the second data set and so forth until it reads all observations from all data sets.

Example 1: Concatenation Using the DATA Step

In this example, each data set contains the variables Common and Number, and the observations are arranged in the order of the values of Common. Generally, you concatenate SAS data sets that have the same variables. In this case, each data set also contains a unique variable to show the effects of combining data sets more clearly. The following shows the ANIMAL and the PLANT input data sets in the library that is referenced by the libref EXAMPLE:

 ANIMAL                             PLANT    OBS  Common  Animal  Number        OBS  Common  Plant     Number     1     a     Ant       5            1     g     Grape       69     2     b     Bird                   2     h     Hazelnut    55     3     c     Cat      17            3     i     Indigo     4     d     Dog       9            4     j     Jicama      14     5     e     Eagle                  5     k     Kale         5     6     f     Frog     76            6     l     Lentil      77 

The following program uses a SET statement to concatenate the data sets and then prints the results:

 libname example '  SAS-data-library  ';    data example.concatenation;       set example.animal example.plant;    run;    proc print data=example.concatenation;       var Common Animal Plant Number;       title 'Data Set CONCATENATION';    run; 
Output 23.1: Concatenated Data Sets (DATA Step)
start example
 Data Set CONCATENATION 1               Obs   Common   Animal   Plant      Number                 1     a      Ant                    5                 2     b      Bird                   .                 3     c      Cat                   17                 4     d      Dog                    9                 5     e      Eagle                  .                 6     f      Frog                  76                 7     g               Grape        69                 8     h               Hazelnut     55                 9     i               Indigo        .                10     j               Jicama       14                11     k               Kale          5                12     l               Lentil       77 
end example
 

The resulting data set CONCATENATION has 12 observations, which is the sum of the observations from the combined data sets. The program data vector contains all variables from all data sets. The values of variables found in one data set but not in another are set to missing.

Example 2: Concatenation Using SQL

You can also use the SQL language to concatenate tables. In this example, SQL reads each row in both tables and creates a new table named COMBINED. The following shows the YEAR1 and YEAR2 input tables:

 YEAR1                      YEAR2    Date1                      Date2    1996    1997                       1997    1998                       1998    1999                       1999                               2000                               2001 

The following SQL code creates and prints the table COMBINED.

 proc sql;       title 'SQL Table COMBINED';       create table combined as          select * from year1          outer union corr          select * from year2;          select * from combined;    quit; 
Output 23.2: Concatenated Tables (SQL)
start example
 SQL Table COMBINED                         1                                              Year                                          --------                                              1996                                              1997                                              1998                                              1999                                              1997                                              1998                                              1999                                              2000                                              2001 
end example
 

Appending Files

Instead of concatenating data sets or tables, you can append them and produce the same results as concatenation. SAS concatenates data sets (DATA step) and tables (SQL) by reading each row of data to create a new file. To avoid reading all the records, you can append the second file to the first file by using the APPEND procedure:

 proc append base=year1 data=year2;   run; 

The YEAR1 file will contain all rows from both tables.

Note  

You cannot use PROC APPEND to add observations to a SAS data set in a sequential library.

Efficiency

If no additional processing is necessary, using PROC APPEND or the APPEND statement in PROC DATASETS is more efficient than using a DATA step to concatenate data sets.

Interleaving

Definition

Interleaving uses a SET statement and a BY statement to combine multiple data sets into one new data set. The number of observations in the new data set is the sum of the number of observations from the original data sets. However, the observations in the new data set are arranged by the values of the BY variable or variables and, within each BY group , by the order of the data sets in which they occur. You can interleave data sets either by using a BY variable or by using an index.

Syntax

Use this form of the SET statement to interleave data sets when you use a BY variable:

SET data-set(s) ;

BY variable(s) ;

where

data-set

specifies a one-level name, a two-level name, or one of the special SAS data set names .

variable

specifies each variable by which the data set is sorted. These variables are referred to as BY variables for the current DATA or PROC step.

Use this form of the SET statement to interleave data sets when you use an index:

SET data-set-1 data-set-n KEY= index ;

where

data-set

specifies a one-level name, a two-level name, or one of the special SAS data set names.

index

provides nonsequential access to observations in a SAS data set, which are based on the value of an index variable or key.

For a complete description of the SET statement, including SET with the KEY= option, see SAS Language Reference: Dictionary .

Sort Requirements

Before you can interleave data sets, the observations must be sorted or grouped by the same variable or variables that you use in the BY statement, or you must have an appropriate index for the data sets.

DATA Step Processing During Interleaving

Compilation phase

  • SAS reads the descriptor information of each data set that is named in the SET statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.

  • SAS creates the FIRST. variable and LAST. variable for each variable listed in the BY statement.

Execution - Step 1

SAS compares the first observation from each data set that is named in the SET statement to determine which BY group should appear first in the new data set. It reads all observations from the first BY group from the selected data set. If this BY group appears in more than one data set, it reads from the data sets in the order in which they appear in the SET statement. The values of the variables in the program data vector are set to missing each time SAS starts to read a new data set and when the BY group changes.

Execution - Step 2

SAS compares the next observations from each data set to determine the next BY group and then starts reading observations from the selected data set in the SET statement that contains observations for this BY group. SAS continues until it has read all observations from all data sets.

Example 1: Interleaving in the Simplest Case

In this example, each data set contains the BY variable Common, and the observations are arranged in order of the values of the BY variable. The following shows the ANIMAL and the PLANT input data sets in the library that is referenced by the libref EXAMPLE:

 ANIMAL                     PLANT          OBS  Common  Animal      OBS Common  Plant           1     a     Ant          1    a     Apple           2     b     Bird         2    b     Banana           3     c     Cat          3    c     Coconut           4     d     Dog          4    d     Dewberry           5     e     Eagle        5    e     Eggplant           6     f     Frog         6    f     Fig 

The following program uses SET and BY statements to interleave the data sets, and prints the results:

 data example.interleaving;       set example.animal example.plant;       by Common;    run;    proc print data=example.interleaving;       title 'Data Set INTERLEAVING';    run; 
Output 23.3: Interleaved Data Sets
start example
 Data Set INTERLEAVING                      1                                Obs    common    animal    plant                                  1      a       Ant                                  2      a                 Apple                                  3      b       Bird                                  4      b                 Banana                                  5      c       Cat                                  6      c                 Coconut                                  7      d       Dog                                  8      d                 Dewberry                                  9      e       Eagle                                  10     e                 Eggplant                                  11     f       Frog                                  12     f                 Fig 
end example
 

The resulting data set INTERLEAVING has 12 observations, which is the sum of the observations from the combined data sets. The new data set contains all variables from both data sets. The value of variables found in one data set but not in the other are set to missing, and the observations are arranged by the values of the BY variable.

Example 2: Interleaving with Duplicate Values of the BY variable

If the data sets contain duplicate values of the BY variables, the observations are written to the new data set in the order in which they occur in the original data sets. This example contains duplicate values of the BY variable Common. The following shows the ANIMAL1 and PLANT1 input data sets:

 ANIMAL1                      PLANT1   OBS  Common  Animal1         OBS  Common  Plant1    1     a      Ant             1     a     Apple    2     a      Ape             2     b     Banana    3     b      Bird            3     c     Coconut    4     c      Cat             4     c     Celery    5     d      Dog             5     d     Dewberry    6     e      Eagle           6     e     Eggplant 

The following program uses SET and BY statements to interleave the data sets, and prints the results:

 data example.interleaving2;       set example.animal1 example.plant1;       by Common;    run;    proc print data=example.interleaving2;       title 'Data Set INTERLEAVING2: Duplicate BY Values';    run; 
Output 23.4: Interleaved Data Sets with Duplicate Values of the BY Variable
start example
 Data Set INTERLEAVING2: Duplicate BY Values                1                              Obs    Common    Animal1    Plant1                                1      a       Ant                                2      a       Ape                                3      a                  Apple                                4      b       Bird                                5      b                  Banana                                6      c       Cat                                7      c                  Coconut                                8      c                  Celery                                9      d       Dog                                10     d                  Dewberry                                11     e       Eagle                                12     e                  Eggplant 
end example
 

The number of observations in the new data set is the sum of the observations in all the data sets. The observations are written to the new data set in the order in which they occur in the original data sets.

Example 3: Interleaving with Different BY Values in Each Data Set

The data sets ANIMAL2 and PLANT2 both contain BY values that are present in one data set but not in the other. The following shows the ANIMAL2 and the PLANT2 input data sets:

 ANIMAL2                      PLANT2   OBS  Common  Animal2        OBS  Common  Plant2    1     a      Ant            1     a     Apple    2     c      Cat            2     b     Banana    3     d      Dog            3     c     Coconut    4     e      Eagle          4     e     Eggplant    5     f      Fig 

This program uses SET and BY statements to interleave these data sets, and prints the results:

 data example.interleaving3;      set example.animal2 example.plant2;      by Common;   run;   proc print data=example.interleaving3;      title 'Data Set INTERLEAVING3: Different BY Values';   run; 
Output 23.5: Interleaving Data Sets with Different BY Values
start example
 Data Set INTERLEAVING3: Different BY Values 1                    Obs    Common    Animal2    Plant2                     1       a       Ant                     2       a                  Apple                     3       b                  Banana                     4       c       Cat                     5       c                  Coconut                     6       d       Dog                     7       e       Eagle                     8       e                  Eggplant                     9       f                  Fig 
end example
 

The resulting data set has nine observations arranged by the values of the BY variable.

Comments and Comparisons

  • In other languages, the term merge is often used to mean interleave . SAS reserves the term merge for the operation in which observations from two or more data sets are combined into one observation. The observations in interleaved data sets are not combined; they are copied from the original data sets in the order of the values of the BY variable.

  • If one table has multiple rows with the same BY value, the DATA step preserves the order of those rows in the result.

  • To use the DATA step, the input tables must be appropriately sorted or indexed. SQL does not require the input tables to be in order.

One-to-One Reading

Definition

One-to-one reading combines observations from two or more data sets into one observation by using two or more SET statements to read observations independently from each data set. This process is also called one-to-one matching . The new data set contains all the variables from all the input data sets. The number of observations in the new data set is the number of observations in the smallest original data set. If the data sets contain common variables, the values that are read in from the last data set replace the values that were read in from earlier data sets.

Syntax

Use this form of the SET statement for one-to-one reading:

SET data-set-1 ;

SET data-set-2 ;

where

data-set-1

specifies a one-level name, a two-level name, or one of the special SAS data set names. data-set-1 is the first file that the DATA step reads.

data-set-2

specifies a one-level name, a two-level name, or one of the special SAS data set names. data-set-2 is the second file that the DATA step reads.

Caution  

Use care when you combine data sets with multiple SET statements. Using multiple SET statements to combine observations can produce undesirable results. Test your program on representative samples of the data sets before using this method to combine them.

For a complete description of the SET statement, see SAS Language Reference: Dictionary .

DATA Step Processing During a One-to-One Reading

Compilation phase

SAS reads the descriptor information of each data set named in the SET statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.

Execution - Step 1

When SAS executes the first SET statement, SAS reads the first observation from the first data set into the program data vector. The second SET statement reads the first observation from the second data set into the program data vector. If both data sets contain the same variables, the values from the second data set replace the values from the first data set, even if the value is missing. After reading the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. The SET statement does not reset the values in the program data vector to missing, except for those variables that were created or assigned values during the DATA step.

Execution - Step 2

SAS continues reading from one data set and then the other until it detects an end-of-file indicator in one of the data sets. SAS stops processing with the last observation of the shortest data set and does not read the remaining observations from the longer data set.

Example 1: One-to-One Reading: Processing an Equal Number of Observations

The SAS data sets ANIMAL and PLANT both contain the variable Common, and are arranged by the values of that variable. The following shows the ANIMAL and the PLANT input data sets:

 ANIMAL                     PLANT   OBS  Common  Animal        OBS  Common  Plant    1     a     Ant            1     a     Apple    2     b     Bird           2     b     Banana    3     c     Cat            3     c     Coconut    4     d     Dog            4     d     Dewberry    5     e     Eagle          5     e     Eggplant    6     f     Frog           6     g     Fig 

The following program uses two SET statements to combine observations from ANIMAL and PLANT, and prints the results:

 data twosets;      set animal;      set plant;   run;      proc print data=twosets;      title 'Data Set TWOSETS - Equal Number of Observations';   run; 
Output 23.6: Data Set Created from Two Data Sets That Have Equal Observations
start example
 Data Set TWOSETS - Equal Number of Observations                1                      Obs    Common    Animal    Plant                      1      a         Ant       Apple                      2      b         Bird      Banana                      3      c         Cat       Coconut                      4      d         Dog       Dewberry                      5      e         Eagle     Eggplant                      6      g         Frog      Fig 
end example
 

Each observation in the new data set contains all the variables from all the data sets. Note, however, that the Common variable value in observation 6 contains a "g." The value of Common in observation 6 of the ANIMAL data set was overwritten by the value in PLANT, which was the data set that SAS read last.

Comments and Comparisons

  • The results that are obtained by reading observations using two or more SET statements are similar to those that are obtained by using the MERGE statement with no BY statement. However, with one-to-one reading, SAS stops processing before all observations are read from all data sets if the number of observations in the data sets is not equal.

  • Using multiple SET statements with other DATA step statements makes the following applications possible:

    • merging one observation with many

    • conditionally merging observations

    • reading from the same data set twice.

One-to-One Merging

Definition

One-to-one merging combines observations from two or more SAS data sets into a single observation in a new data set. To perform a one-to-one merge, use the MERGE statement without a BY statement. SAS combines the first observation from all data sets in the MERGE statement into the first observation in the new data set, the second observation from all data sets into the second observation in the new data set, and so on. In a one-to-one merge, the number of observations in the new data set equals the number of observations in the largest data set that was named in the MERGE statement.

If you use the MERGENOBY= SAS system option, you can control whether SAS issues a message when MERGE processing occurs without an associated BY statement.

Syntax

Use this form of the MERGE statement to merge SAS data sets:

MERGE data-set(s) ;

where

data-set

names at least two existing SAS data sets.

Caution  

Avoid using duplicate values or different values of common variables. One-to-one merging with data sets that contain duplicate values of common variables can produce undesirable results. If a variable exists in more than one data set, the value from the last data set that is read is the one that is written to the new data set. The variables are combined exactly as they are read from each data set. Using a one-to-one merge to combine data sets with different values of common variables can also produce undesirable results. If a variable exists in more than one data set, the value from the last data set read is the one that is written to the new data set even if the value is missing. Once SAS has processed all observations in a data set, all subsequent observations in the new data set have missing values for the variables that are unique to that data set.

For a complete description of the MERGE statement, see SAS Language Reference: Dictionary .

DATA Step Processing During One-to-One Merging

Compilation phase

SAS reads the descriptor information of each data set that is named in the MERGE statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.

Execution - Step 1

SAS reads the first observation from each data set into the program data vector, reading the data sets in the order in which they appear in the MERGE statement. If two data sets contain the same variables, the values from the second data set replace the values from the first data set. After reading the first observation from the last data set and executing any other statements in the DATA step, SAS writes the contents of the program data vector to the new data set. Only those variables that are created or assigned values during the DATA step are set to missing.

Execution - Step 2

SAS continues until it has read all observations from all data sets.

Example 1: One-to-One Merging with an Equal Number of Observations

The SAS data sets ANIMAL and PLANT both contain the variable Common, and the observations are arranged by the values of Common. The following shows the ANIMAL and the PLANT input data sets:

 ANIMAL                     PLANT   OBS  Common  Animal        OBS  Common  Plant    1     a     Ant            1     a     Apple    2     b     Bird           2     b     Banana    3     c     Cat            3     c     Coconut    4     d     Dog            4     d     Dewberry    5     e     Eagle          5     e     Eggplant    6     f     Frog           6     g     Fig 

The following program merges these data sets and prints the results:

 data combined;      merge animal plant;   run;   proc print data=combined;      title 'Data Set COMBINED';   run; 
Output 23.7: Merged Data Sets That Have an Equal Number of Observations
start example
 Data Set COMBINED                     1                 Obs   Common    Animal    Plant                  1      a       Ant       Apple                  2      b       Bird      Banana                  3      c       Cat       Coconut                  4      d       Dog       Dewberry                  5      e       Eagle     Eggplant                  6      g       Frog      Fig 
end example
 

Each observation in the new data set contains all variables from all data sets. If two data sets contain the same variables, the values from the second data set replace the values from the first data set, as shown in observation 6.

Example 2: One-to-One Merging with an Unequal Number of Observations

The SAS data sets ANIMAL1 and PLANT1 both contain the variable Common, and the observations are arranged by the values of Common. The PLANT1 data set has fewer observations than the ANIMAL1 data set. The following shows the ANIMAL1 and the PLANT1 input data sets:

 ANIMAL1                     PLANT1   OBS  Common  Animal        OBS   Common  Plant    1     a     Ant            1      a     Apple    2     b     Bird           2      b     Banana    3     c     Cat            3      c     Coconut    4     d     Dog    5     e     Eagle    6     f     Frog 

The following program merges these unequal data sets and prints the results:

 data combined1;       merge animal1 plant1;   run;   proc print data=combined1;       title 'Data Set COMBINED1';   run; 
Output 23.8: Merged Data Sets That Have an Unequal Number of Observations
start example
 Data Set COMBINED1                         1                  Obs   Common    Animal     Plant                   1      a       Ant       Apple                   2      b       Bird      Banana                   3      c       Cat       Coconut                   4      d       Dog                   5      e       Eagle                   6      f       Frog 
end example
 

Note that observations 4 through 6 contain missing values for the variable Plant.

Example 3: One-to-One Merging with Duplicate Values of Common Variables

The following example shows the undesirable results that you can obtain by using one-to-one merging with data sets that contain duplicate values of common variables. The value from the last data set that is read is the one that is written to the new data set. The variables are combined exactly as they are read from each data set. In the following example, the data sets ANIMAL1 and PLANT1 contain the variable Common, and each data set contains observations with duplicate values of Common. The following shows the ANIMAL1 and the PLANT1 input data sets:

 ANIMAL1                     PLANT1   OBS  Common  Animal        OBS   Common  Plant    1     a     Ant            1      a     Apple    2     a     Ape            2      b     Banana    3     b     Bird           3      c     Coconut    4     c     Cat            4      c     Celery    5     d     Dog            5      d     Dewberry    6     e     Eagle          6      e     Eggplant 

The following program produces the data set MERGE1 data set and prints the results:

 /* This program illustrates undesirable results. */   data merge1;      merge animal1 plant1;   run;   proc print data=merge1;      title 'Data Set MERGE1';   run; 
Output 23.9: Undesirable Results with Duplicate Values of Common Variables
start example
 Data Set MERGE1                         1                Obs   Common    Animal1   Plant1                 1      a       Ant       Apple                 2      b       Ape       Banana                 3      c       Bird      Coconut                 4      c       Cat       Celery                 5      d       Dog       Dewberry                 6      e       Eagle     Eggplant 
end example
 

The number of observations in the new data set is six. Note that observations 2 and 3 contain undesirable values. SAS reads the second observation from data set ANIMAL1. It then reads the second observation from data set PLANT1 and replaces the values for the variables Common and Plant1. The third observation is created in the same way.

Example 4: One-to-One Merging with Different Values of Common Variables

The following example shows the undesirable results obtained from using the one-to-one merge to combine data sets with different values of common variables. If a variable exists in more than one data set, the value from the last data set that is read is the one that is written to the new data set even if the value is missing. Once SAS processes all observations in a data set, all subsequent observations in the new data set have missing values for the variables that are unique to that data set. In this example, the data sets ANIMAL2 and PLANT2 have different values of the Common variable. The following shows the ANIMAL2 and the PLANT2 input data sets:

 ANIMAL2                      PLANT2   OBS  Common  Animal          OBS  Common  Plant    1     a     Ant              1     a     Apple    2     c     Cat              2     b     Banana    3     d     Dog              3     c     Coconut    4     e     Eagle            4     e     Eggplant                                 5     f     Fig 

The following program produces the data set MERGE2 and prints the results:

 /* This program illustrates undesirable results. */   data merge2;      merge animal2 plant2;   run;   proc print data=merge2;      title 'Data Set MERGE2';   run; 
Output 23.10: Undesirable Results with Different Values of Common Variables
start example
 Data Set MERGE2                         1                     Obs   Common   Animal2   Plant2                      1      a      Ant       Apple                      2      b      Cat       Banana                      3      c      Dog       Coconut                      4      e      Eagle     Eggplant                      5      f                Fig 
end example
 

Comments and Comparisons

The results from a one-to-one merge are similar to the results obtained from using two or more SET statements to combine observations. However, with the one-to-one merge, SAS continues processing all observations in all data sets that were named in the MERGE statement.

Match-Merging

Definition

Match-merging combines observations from two or more SAS data sets into a single observation in a new data set according to the values of a common variable. The number of observations in the new data set is the sum of the largest number of observations in each BY group in all data sets. To perform a match-merge, use the MERGE statement with a BY statement. Before you can perform a match-merge, all data sets must be sorted by the variables that you specify in the BY statement or they must have an index.

Syntax

Use this form of the MERGE statement to match-merge data sets:

MERGE data-set(s) ;

BY variable(s) ;

where

data-set

names at least two existing SAS data sets from which observations are read.

variable

names each variable by which the data set is sorted or indexed. These variables are referred to as BY variables.

For a complete description of the MERGE and the BY statements, see SAS Language Reference: Dictionary .

DATA Step Processing During Match-Merging

Compilation phase

SAS reads the descriptor information of each data set that is named in the MERGE statement and then creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step. SAS creates the FIRST. variable and LAST. variable for each variable that is listed in the BY statement.

Execution - Step 1

SAS looks at the first BY group in each data set that is named in the MERGE statement to determine which BY group should appear first in the new data set. The DATA step reads into the program data vector the first observation in that BY group from each data set, reading the data sets in the order in which they appear in the MERGE statement. If a data set does not have observations in that BY group, the program data vector contains missing values for the variables unique to that data set.

Execution - Step 2

After processing the first observation from the last data set and executing other statements, SAS writes the contents of the program data vector to the new data set. SAS retains the values of all variables in the program data vector except those variables that were created by the DATA step; SAS sets those values to missing. SAS continues to merge observations until it writes all observations from the first BY group to the new data set. When SAS has read all observations in a BY group from all data sets, it sets all variables in the program data vector to missing. SAS looks at the next BY group in each data set to determine which BY group should appear next in the new data set.

Execution - Step 3

SAS repeats these steps until it reads all observations from all BY groups in all data sets.

Example 1: Combining Observations Based on a Criterion

The SAS data sets ANIMAL and PLANT each contain the BY variable Common, and the observations are arranged in order of the values of the BY variable. The following shows the ANIMAL and the PLANT input data sets:

 ANIMAL                     PLANT          OBS  Common  Animal      OBS Common  Plant           1     a     Ant          1    a     Apple           2     b     Bird         2    b     Banana           3     c     Cat          3    c     Coconut           4     d     Dog          4    d     Dewberry           5     e     Eagle        5    e     Eggplant           6     f     Frog         6    f     Fig 

The following program merges the data sets according to the values of the BY variable Common, and prints the results:

 data combined;       merge animal plant;       by Common;   run;   proc print data=combined;      title 'Data Set COMBINED';   run; 
Output 23.11: Data Sets Combined by Match-Merging
start example
 Data Set COMBINED                     1                     Obs   Common    Animal   Plant                      1      a       Ant      Apple                      2      b       Bird     Banana                      3      c       Cat      Coconut                      4      d       Dog      Dewberry                      5      e       Eagle    Eggplant                      6      f       Frog     Fig 
end example
 

Each observation in the new data set contains all the variables from all the data sets.

Example 2: Match-Merge with Duplicate Values of the BY Variable

When SAS reads the last observation from a BY group in one data set, SAS retains its values in the program data vector for all variables that are unique to that data set until all observations for that BY group have been read from all data sets. In the following example, the data sets ANIMAL1 and PLANT1 contain duplicate values of the BY variable Common. The following shows the ANIMAL1 and the PLANT1 input data sets:

 ANIMAL1                      PLANT1   OBS  Common  Animal1         OBS  Common  Plant1    1     a      Ant             1     a     Apple    2     a      Ape             2     b     Banana    3     b      Bird            3     c     Coconut    4     c      Cat             4     c     Celery    5     d      Dog             5     d     Dewberry    6     e      Eagle           6     e     Eggplant 

The following program produces the merged data set MATCH1, and prints the results:

 data match1;      merge animal1 plant1;      by Common;   run;   proc print data=match1;      title 'Data Set MATCH1';   run; 
Output 23.12: Match-Merged Data Set with Duplicate BY Values
start example
 Data Set MATCH1                     1                    Obs   Common    Animal1   Plant1                     1      a       Ant       Apple                     2      a       Ape       Apple                     3      b       Bird      Banana                     4      c       Cat       Coconut                     5      c       Cat       Celery                     6      d       Dog       Dewberry                     7      e       Eagle     Eggplant 
end example
 

In observation 2 of the output, the value of the variable Plant1 is retained until all observations in the BY group are written to the new data set. Match-merging also produced duplicate values in ANIMAL1 for observations 4 and 5.

Example 3: Match-Merge with Nonmatched Observations

When SAS performs a match-merge with nonmatched observations in the input data sets, SAS retains the values of all variables in the program data vector even if the value is missing. The data sets ANIMAL2 and PLANT2 do not contain all values of the BY variable Common. The following shows the ANIMAL2 and the PLANT2 input data sets:

 ANIMAL2                      PLANT2   OBS  Common  Animal2        OBS  Common  Plant2    1     a      Ant            1     a     Apple    2     c      Cat            2     b     Banana    3     d      Dog            3     c     Coconut    4     e      Eagle          4     e     Eggplant    5     f      Fig 

The following program produces the merged data set MATCH2, and prints the results:

 data match2;      merge animal2 plant2;      by Common;   run;   proc print data=match2;      title 'Data Set MATCH2';   run; 
Output 23.13: Match-Merged Data Set with Nonmatched Observations
start example
 Data Set MATCH2                         1                    Obs   Common    Animal2   Plant2                     1      a        Ant      Apple                     2      b                 Banana                     3      c        Cat      Coconut                     4      d        Dog                     5      e        Eagle    Eggplant                     6      f                 Fig 
end example
 

As the output shows, all values of the variable Common are represented in the new data set, including missing values for the variables that are in one data set but not in the other.

Updating with the UPDATE and the MODIFY Statements

Definitions

Updating a data set refers to the process of applying changes to a master data set. To update data sets, you work with two input data sets. The data set containing the original information is the master data set , and the data set containing the new information is the transaction data set .

You can update data sets by using the UPDATE statement or the MODIFY statement:

UPDATE

uses observations from the transaction data set to change the values of corresponding observations from the master data set. You must use a BY statement with the UPDATE statement because all observations in the transaction data set are keyed to observations in the master data set according to the values of the BY variable.

MODIFY

can replace, delete, and append observations in an existing data set. Using the MODIFY statement can save disk space because it modifies data in place, without creating a copy of the data set.

The number of observations in the new data set is the sum of the number of observations in the master data set and the number of unmatched observations in the transaction data set.

For complete information about the UPDATE and the MODIFY statements, see SAS Language Reference: Dictionary .

Syntax of the UPDATE Statement

Use this form of the UPDATE statement to update a master data set:

UPDATE master-data-set transaction-data-set ;

BY variable-list ;

where

master-data-set

names the SAS data set that is used as the master file.

transaction-data-set

names the SAS data set that contains the changes to be applied to the master data set.

variable-list

specifies the variables by which observations are matched.

If the transaction data set contains duplicate values of the BY variable, SAS applies both transactions to the observation. The last values that are copied into the program data vector are written to the new data set. If your data is in this form, use the MODIFY statement instead of the UPDATE statement to process your data.

Caution  

Values of the BY variable must be unique for each observation in the master data set. If the master data set contains two observations with the same value of the BY variable, the first observation is updated and the second observation is ignored. SAS writes a warning message to the log when the DATA step executes.

For complete information about the UPDATE statement, see SAS Language Reference: Dictionary .

Syntax of the MODIFY Statement

This form of the MODIFY statement is used in the examples that follow:

MODIFY master-data-set ;

BY variable-list ;

where

master-data-set

specifies the SAS data set that you want to modify.

variable-list

names each variable by which the data set is ordered.

Note  

The MODIFY statement does not support changing the descriptor portion of a SAS data set, such as adding a variable.

For complete information about the MODIFY statement, see SAS Language Reference: Dictionary .

DATA Step Processing with the UPDATE Statement

Compilation phase

  • SAS reads the descriptor information of each data set that is named in the UPDATE statement and creates a program data vector that contains all the variables from all data sets as well as variables created by the DATA step.

  • SAS creates the FIRST. variable and LAST. variable for each variable that is listed in the BY statement.

Execution - Step 1

SAS looks at the first observation in each data set that is named in the UPDATE statement to determine which BY group should appear first. If the transaction BY value precedes the master BY value, SAS reads from the transaction data set only and sets the variables from the master data set to missing. If the master BY value precedes the transaction BY value, SAS reads from the master data set only and sets the unique variables from the transaction data set to missing. If the BY values in the master and transaction data sets are equal, it applies the first transaction by copying the nonmissing values into the program data vector.

Execution - Step 2

After completing the first transaction, SAS looks at the next observation in the transaction data set. If SAS finds one with the same BY value, it applies that transaction too. The first observation then contains the new values from both transactions. If no other transactions exist for that observation, SAS writes the observation to the new data set and sets the values in the program data vector to missing. SAS repeats these steps until it has read all observations from all BY groups in both data sets.

Updating with Nonmatched Observations, Missing Values, and New Variables

In the UPDATE statement, if an observation in the master data set does not have a corresponding observation in the transaction data set, SAS writes the observation to the new data set without modifying it. Any observation from the transaction data set that does not correspond to an observation in the master data set is written to the program data vector and becomes the basis for an observation in the new data set. The data in the program data vector can be modified by other transactions before it is written to the new data set. If a master data set observation does not need updating, the corresponding observation can be omitted from the transaction data set.

SAS does not replace existing values in the master data set with missing values if those values are coded as periods (for numeric variables) or blanks (for character variables) in the transaction data set. To replace existing values with missing values, you must either create a transaction data set in which missing values are coded with the special missing value characters , or use the UPDATEMODE=NOMISSINGCHECK statement option.

With UPDATE, the transaction data set can contain new variables to be added to all observations in the master data set.

To view a sample program, see "Example 3: Using UPDATE for Processing Nonmatched Observations, Missing Values, and New Variables" on page 425.

Sort Requirements for the UPDATE Statement

If you do not use an index, both the master data set and the transaction data set must be sorted by the same variable or variables that you specify in the BY statement that accompanies the UPDATE statement. The values of the BY variable should be unique for each observation in the master data set. If you use more than one BY variable, the combination of values of all BY variables should be unique for each observation in the master data set. The BY variable or variables should be ones that you never need to update.

Note  

The MODIFY statement does not require sorted files. However, sorting the data improves efficiency.

Using an Index with the MODIFY Statement

The MODIFY statement maintains the index. You do not have to rebuild the index like you do for the UPDATE statement.

Choosing between UPDATE or MODIFY with BY

Using the UPDATE statement is comparable to using MODIFY with BY to apply transactions to a data set. While MODIFY is a more powerful tool with several other applications, UPDATE is still the tool of choice in some cases. The following table helps you choose whether to use UPDATE or MODIFY with BY.

Table 23.3: MODIFY with BY versus UPDATE

Issue

MODIFY with BY

UPDATE

Disk space

saves disk space because it updates data in place

requires more disk space because it produces an updated copy of the data set

Sort and index

sorted input data sets are not required, although for good performance, it is strongly recommended that both data sets be sorted and that the master data set be indexed

requires only that both data sets be sorted

When to use

use only when you expect to process a SMALL portion of the data set

use if you expect to need to process most of the data set

Where to specify the modified data set

specify the updated data set in both the DATA and the MODIFY statements

specify the updated data set in the DATA and the UPDATE statements

Duplicate BY-values

allows duplicate BY-values in both the master and the transaction data sets

allows duplicate BY-values in the transaction data set only (If duplicates exist in the master data set, SAS issues a warning.)

Scope of changes

cannot change the data set descriptor information, so changes such as adding or deleting variables, variable labels, and so on, are not valid

can make changes that require a change in the descriptor portion of a data set, such as adding new variables, and so on

Error checking

has error-checking capabilities using the _IORC_ automatic variable and the SYSRC autocall macro

needs no error checking because transactions without a corresponding master record are not applied but are added to the data set

Data set integrity

data may only be partially updated due to an abnormal task termination

no data loss occurs because UPDATE works on a copy of the data

For more information about tools for combining SAS data sets, see Table 23.2 on page 399.

Primary Uses of the MODIFY Statement

The MODIFY statement has three primary uses:

  • modifying observations in a single SAS data set.

  • modifying observations in a single SAS data set directly, either by observation number or by values in an index.

  • modifying observations in a master data set, based on values in a transaction data set. MODIFY with BY is similar to using the UPDATE statement.

Several of the examples that follow demonstrate these uses.

Example 1: Using UPDATE for Basic Updating

In this example, the data set MASTER contains original values of the variables Animal and Plant. The data set NEWPLANT is a transaction data set with new values of the variable Plant. The following shows the MASTER and the NEWPLANT input data sets:

 MASTER                          NEWPLANT   OBS Common Animal Plant          OBS Common Plant    1    a    Ant    Apple           1    a    Apricot    2    b    Bird   Banana          2    b    Barley    3    c    Cat    Coconut         3    c    Cactus    4    d    Dog    Dewberry        4    d    Date    5    e    Eagle  Eggplant        5    e    Escarole    6    f    Frog   Fig             6    f    Fennel 

The following program updates MASTER with the transactions in the data set NEWPLANT, writes the results to UPDATE_FILE, and prints the results:

 data update_file;      update master newplant;      by common;   run;   proc print data=update_file;      title 'Data Set Update_File';   run; 
Output 23.14: Master Data Set Updated by Transaction Data Set
start example
 Data Set Update_File                         1                     Obs    Common    Animal    Plant                      1       a       Ant       Apricot                      2       b       Bird      Barley                      3       c       Cat       Cactus                      4       d       Dog       Date                      5       e       Eagle     Escarole                      6       f       Frog      Fennel 
end example
 

Each observation in the new data set contains a new value for the variable Plant.

Example 2: Using UPDATE with Duplicate Values of the BY Variable

If the master data set contains two observations with the same value of the BY variable, the first observation is updated and the second observation is ignored. SAS writes a warning message to the log. If the transaction data set contains duplicate values of the BY variable, SAS applies both transactions to the observation. The last values copied into the program data vector are written to the new data set. The following shows the MASTER1 and the DUPPLANT input data sets.

 MASTER1                           DUPPLANT   OBS Common Animal1 Plant1           OBS Common Plant1    1    a    Ant     Apple            1    a     Apricot    2    b    Bird    Banana           2    b     Barley    3    b    Bird    Banana           3    c     Cactus    4    c    Cat     Coconut          4    d     Date    5    d    Dog     Dewberry         5    d     Dill    6    e    Eagle   Eggplant         6    e     Escarole    7    f    Frog    Fig              7    f     Fennel 

The following program applies the transactions in DUPPLANT to MASTER1 and prints the results:

 data update1;      update master1 dupplant;      by Common;   run;   proc print data=update1;      title 'Data Set Update1';   run; 
Output 23.15: Updating Data Sets with Duplicate BY Values
start example
 Data Set Update1                         1                       Obs    Common    Animal1    Plant1                        1       a       Ant        Apricot                        2       b       Bird       Barley                        3       b       Bird       Banana                        4       c       Cat        Cactus                        5       d       Dog        Dill                        6       e       Eagle      Escarole                        7       f       Frog       Fennel 
end example
 

When this DATA step executes, SAS generates a warning message stating that there is more than one observation for a BY group. However, the DATA step continues to process, and the data set UPDATE1 is created.

The resulting data set has seven observations. Observations 2 and 3 have duplicate values of the BY variable Common. However, the value of the variable PLANT1 was not updated in the second occurrence of the duplicate BY value.

Example 3: Using UPDATE for Processing Nonmatched Observations, Missing Values, and New Variables

In this example, the data set MASTER2 is a master data set. It contains a missing value for the variable Plant2 in the first observation, and not all of the values of the BY variable Common are included. The transaction data set NONPLANT contains a new variable Mineral, a new value of the BY variable Common, and missing values for several observations. The following shows the MASTER2 and the NONPLANT input data sets:

 MASTER2                              NONPLANT   OBS  Common  Animal2  Plant2         OBS  Common  Plant2   Mineral    1     a     Ant                      1     a     Apricot  Amethyst    2     c     Cat      Coconut         2     b     Barley   Beryl    3     d     Dog      Dewberry        3     c     Cactus    4     e     Eagle    Eggplant        4     e    5     f     Frog     Fig             5     f     Fennel                                         6     g     Grape    Garnet 

The following program updates the data set MASTER2 and prints the results:

 data update2_file;      update master2 nonplant;      by Common;   run;   proc print data=update2_file;      title 'Data Set Update2_File';   run; 
Output 23.16: Results of Updating with New Variables, Nonmatched Observations, and Missing Values
start example
 Data Set Update2_File                         1                    Obs    Common    Animal2    Plant2    Mineral                     1       a        Ant       Apricot   Amethyst                     2       b                  Barley    Beryl                     3       c        Cat       Cactus                     4       d        Dog       Dewberry                     5       e        Eagle     Eggplant                     6       f        Frog      Fennel                     7       g                  Grape     Garnet 
end example
 

As shown, all observations now include values for the variable Mineral. The value of Mineral is set to missing for some observations. Observations 2 and 6 in the transaction data set did not have corresponding observations in MASTER2, and they have become new observations. Observation 3 from the master data set was written to the new data set without change, and the value for Plant2 in observation 4 was not changed to missing. Three observations in the new data set have updated values for the variable Plant2.

The following program uses the UPDATEMODE statement option on the UPDATE statement, and prints the results:

 data update2_file;      update master2 nonplant updatemode=nomissingcheck;      by Common;   run;   proc print data=update2_file;      title 'Data Set Update2_File - UPDATEMODE Option';   run; 
Output 23.17: Results of Updating with the UPDATEMODE Option
start example
 Data Set Update2_File - UPDATEMODE Option                 1                   Obs   Common    Animal2    Plant2      Mineral                    1      a        Ant       Apricot     Amethyst                    2      b                  Barley      Beryl                    3      c        Cat       Cactus                    4      d        Dog       Dewberry                    5      e        Eagle                    6      f        Frog      Fennel                    7      g                  Grape       Garnet 
end example
 

The value of Plant2 in observation 5 is set to missing because the UPDATEMODE=NOMISSINGCHECK option is in effect.

For detailed examples for updating data sets, see Combining and Modifying SAS Data Sets: Examples .

Example 4: Updating a MASTER Data Set by Adding an Observation

If the transaction data set contains an observation that does not match an observation in the master data set, you must alter the program. The Year value in observation 5 of TRANSACTION has no match in MASTER. The following shows the MASTER and the TRANSACTION input data sets:

 MASTER                       TRANSACTION   OBS  Year   VarX   VarY       OBS  Year   VarX   VarY    1   1985    x1     y1         1   1991    x2    2   1986    x1     y1         2   1992    x2     y2    3   1987    x1     y1         3   1993    x2    4   1988    x1     y1         4   1993           y2    5   1989    x1     y1         5   1995    x2     y2    6   1990    x1     y1    7   1991    x1     y1    8   1992    x1     y1    9   1993    x1     y1   10   1994    x1     y1 

You must use an explicit OUTPUT statement to write a new observation to a master data set. (The default action for a DATA step using a MODIFY statement is REPLACE, not OUTPUT.) Once you specify an explicit OUTPUT statement, you must also specify a REPLACE statement. The following DATA step updates data set MASTER, based on values in TRANSACTION, and adds a new observation. This program also uses the _IORC_ automatic variable for error checking. (For more information about error checking, see "Error Checking When Using Indexes to Randomly Access or Update Data" on page 428.

Output 23.18: Modified MASTER Data Set
start example
 Updated Master Data Set -- MODIFY                     1                           One Observation Added                        Obs    Year    VarX    VarY                          1    1985     x1      y1                          2    1986     x1      y1                          3    1987     x1      y1                          4    1988     x1      y1                          5    1989     x1      y1                          6    1990     x1      y1                          7    1991     x2      y1                          8    1992     x2      y2                          9    1993     x2      y2                         10    1994     x1      y1                         11    1995     x2      y2 
end example
 

SAS added a new observation, observation 11, to the MASTER data set and updated observations 7, 8, and 9.




SAS 9.1.3 Language Reference. Concepts
SAS 9.1.3 Language Reference: Concepts, Third Edition, Volumes 1 and 2
ISBN: 1590478401
EAN: 2147483647
Year: 2004
Pages: 258

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net